last year occurrence from string - regex

I have strings like this:
ACB 01900 X1911D 1910 1955-2011 3424 2135 1934 foobar
I'm trying to get the last occurrence of a single year (from 1900 to 2050), so I need to extract only 1934 from that string.
I'm trying with:
grep -P -o '\s(19|20)[0-9]{2}\s(?!\s(19|20)[0-9]{2}\s)'
or
grep -P -o '((19|20)[0-9]{2})(?!\s\1\s)'
But it matches: 1910 and 1934
Here's the Regex101 example:
https://regex101.com/r/UetMl0/3
https://regex101.com/r/UetMl0/4
Plus: how can I extract the year without the surrounding spaces without doing an extra grep to filter them?

Have you ever heard this saying:
Some people, when confronted with a problem, think
“I know, I'll use regular expressions.” Now they have two problems.
Keep it simple - you're interested in finding a number between 2 numbers so just use a numeric comparison, not a regexp:
$ awk -v min=1900 -v max=2050 '{yr=""; for (i=1;i<=NF;i++) if ( ($i ~ /^[0-9]{4}$/) && ($i >= min) && ($i <= max) ) yr=$i; print yr}' file
1934
You didn't say what to do if no date within your range is present so the above outputs a blank line if that happens but is easily tweaked to do anything else.
To change the above script to find the first instead of the last date is trivial (move the print inside the if), to use different start or end dates in your range is trivial (change the min and/or max values), etc., etc. which is a strong indication that this is the right approach. Try changing any of those requirements with a regexp-based solution.

I don't see a way to do this with grep because it doesn't let you output just one of the capture groups, only the whole match.
Wit perl I'd do something like
perl -lpe 'if (/^.*\b(19\d\d|20(?:0-4\d|50))\b/) { print $1 }'
Idea: Use ^.* (greedy) to consume as much of the string up front as possible, thus finding the last possible match. Use \b (word boundary) around the matched number to prevent matching 01900 or X1911D. Only print the first capture group ($1).
I tried to implement your requirement of 1900-2050; if that's too complicated, ((?:19|20)\d\d) will do (but also match e.g. 2099).

The regex to do your task using grep can be as follows:
\b(?:19\d{2}|20[0-4]\d|2050)\b(?!.*\b(?:19\d{2}|20[0-4]\d|2050)\b)
Details:
\b - Word boundary.
(?: - Start of a non-capturing group, needed as a container for
alternatives.
19\d{2}| - The first alternative (1900 - 1999).
20[0-4]\d| - The second alternative (2000 - 2049).
2050 - The third alternative, just 2050.
) - End of the non-capturing group.
\b - Word boundary.
(?! - Negative lookahead for:
.* - A sequence of any chars, meaning actually "what follows
can occur anywhere further".
\b(?:19\d{2}|20[0-4]\d|2050)\b - The same expression as before.
) - End of the negative lookahead.
The word boundary anchors provide that you will not match numbers - parts
of longer words, e.g. X1911D.
The negative lookahead provides that you will match just the last
occurrence of the required year.
If you can use other tool than grep, supporting call to a previous
numbered group (?n), where n is the number of another capturing
group, the regex can be a bit simpler:
(\b(?:19\d{2}|20[0-4]\d|2050)\b)(?!.*(?1))
Details:
(\b(?:19\d{2}|20[0-4]\d|2050)\b) - The regex like before, but
enclosed within a capturing group (it will be "called" later).
(?!.*(?1)) - Negative lookahead for capturing group No 1,
located anywhere further.
This way you avoid writing the same expression again.
For a working example in regex101 see https://regex101.com/r/fvVnZl/1

You may use a PCRE regex without any groups to only return the last occurrence of a pattern you need if you prepend the pattern with ^.*\K, or, in your case, since you expect a whitespace boundary, ^(?:.*\s)?\K:
grep -Po '^(?:.*\s)?\K(?:19\d{2}|20(?:[0-4]\d|50))(?!\S)' file
See the regex demo.
Details
^ - start of line
(?:.*\s)? - an optional non-capturing group matching 1 or 0 occurrences of
.* - any 0+ chars other than line break chars, as many as possible
\s - a whitespace char
\K - match reset operator discarding the text matched so far
(?:19\d{2}|20(?:[0-4]\d|50)) - 19 and any two digits or 20 followed with either a digit from 0 to 4 and then any digit (00 to 49) or 50.
(?!\S) - a whitespace or end of string.
See an online demo:
s="ACB 01900 X1911D 1910 1955-2011 3424 2135 1934 foobar"
grep -Po '^(?:.*\s)?\K(?:19\d{2}|20(?:[0-4]\d|50))(?!\S)' <<< "$s"
# => 1934

Related

Regex (PCRE): Match all digits in a line following a line which includes a certain string

Using PCRE, I want to capture only and all digits in a line which follows a line in which a certain string appears. Say the string is "STRING99". Example:
car string99 house 45b
22 dog 1 cat
women 6 man
In this case, the desired result is:
221
As asked a similar question some time ago, however, back then trying to capture the numbers in the SAME line where the string appears ( Regex (PCRE): Match all digits conditional upon presence of a string ). While the question is similar, I don't think the answer, if there is one at all, will be similar. The approach using the newline anchor ^ does not work in this case.
I am looking for a single regular expression without any other programming code. It would be easy to accomplish with two consecutive regex operations, but this not what I'm looking for.
Maybe you could try:
(?:\bstring99\b.*?\n|\G(?!^))[^\d\n]*\K\d
See the online demo
(?: - Open non-capture group:
\bstring99\b - Literally match "string99" between word-boundaries.
.*?\n - Lazy match up to (including) nearest newline character.
| - Or:
\G(?!^) - Asserts position at the end of the previous match but prevent it to be the start of the string for the first match using a negative lookahead.
) - Close non-capture group.
[^\d\n]* - Match 0+ non-digit/newline characters.
\K - Resets the starting point of the reported match.
\d - Match a digit.

RegExp checking for sign only if there is text afterwards

I have some cases, which I need to filter with a regex. The values which need to be filtered are listed below:
// These should be catched
123456_Test.pdf
123456 Test.pdf
123456.pdf
// These shouldn't be catched
123456Abcasd.pdf
123456-Abcasd.pdf
123456_.pdf
The current regEx looks like this:
(\d{6,7})((\_| ){0,1})(.*)\..*
The problem here is, that the latter 3 are also matched. To give you a short overview, whats wrong with the 1st "wrongly" matched strings:
The 1st capture-group has to consist 6-7 digits. (Also the capture-group is needed in the end). If there are letters after these numbers, there has to be a whitespace or underscore. The 1st example of the "shouldn't be catched" shows this. The entry is invalid, since there are letters after 123456 without the needed sign.
The last entry isn't really important, just there for convinience.
What am I missing? How do I adjust my regex in a way, that I can check for signs, only if there are letters following a number-chain?
You may use
^(\d{6,7})([_ ][A-Za-z].*)?\..*$
See the regex demo
Details
^ - start of a string
(\d{6,7}) - Group 1: 6 or 7 digits
([_ ][A-Za-z].*)? - an optional capturing group #2: a _ or space followed with a letter and then any 0+ chars as many as possible, up to the last
\. - . on a line
.* - the rest of the line
$ - end of string.
Check if this perl solution works for you.
> cat regex_catch.dat
123456_Test.pdf
123456 Test.pdf
123456.pdf
123456Abcasd.pdf
123456-Abcasd.pdf
123456_.pdf
> perl -ne ' print if m/\d+(([ _])[a-zA-Z]+| [a-zA-Z]*)?\.pdf/ ' regex_catch.dat
123456_Test.pdf
123456 Test.pdf
123456.pdf
>

Looking for regex to match before and after a number

Given the string
170905-CBM-238.pdf
I'm trying to match 170905-CBM and .pdf so that I can replace/remove them and be left with 238.
I've searched and found pieces that work but can't put it all together.
This-> (.*-) will match the first section and
This-> (.[^/.]+$) will match the last section
But I can't figure out how to tie them together so that it matches everything before, including the second dash and everything after, including the period (or the extension) but does not match the numbers between.
help :) and thank you for your kind consideration.
There are several options to achieve what you need in Nintex.
If you use Extract operation, use (?<=^.*-)\d+(?=\.[^.]*$) as Pattern.
See the regex demo.
Details
(?<=^.*-) - a positive lookbehind requiring, immediately to the left of the current location, the start of string (^), then any 0+ chars other than LF as many as possible up to the last occurrence of - and the subsequent subpatterns
\d+ - 1 or more digits
(?=\.[^.]*$) - a positive lookahead requiring, immediately to the right of the current location, the presence of a . and 0+ chars other than . up to the end of the string.
If you use Replace text operation, use
Pattern: ^.*-([0-9]+)\.[^.]+$
Replacement text: $1
See another regex demo (the Context tab shows the result of the replacement).
Details
^ - a start of string anchor
.* - any 0+ chars other than LF up to the last occurrence of the subsequent subpatterns...
- - a hyphen
([0-9]+) - Group 1: one or more ASCII digits
\. - a literal .
[^.]+ - 1 or more chars other than .
$ - end of string.
The replacement $1 references the value stored in Group 1.
I don't know ninetex regex, but a sed type regex:
$ echo "170905-CBM-238.pdf" | sed -E 's/^.*-([0-9]*)\.[^.]*$/\1/'
238
Same works in Perl:
$ echo "170905-CBM-238.pdf" | perl -pe 's/^.*-([0-9]*)\.[^.]*$/$1/'
238

RegEx skip word

I would like to use regular expressions to extract the first couple of words and the second to last letter of a string.
For example, in the string
"CSC 101 Intro to Computing A R"
I would like to capture
"CSC 101 A"
Maybe something similar to this
grep -o -P '\w{3}\s\d{3}*thenIdon'tKnow*\s\w\s'
Any help would be greatly appreciated.
You could go for:
^((?:\w+\W+){2}).*(\w+)\W+\w+$
And use group 1 + 2, see it working on regex101.com.
Broken down, this says:
^ # match the start of the line/string
( # capture group 1
(?:\w+\W+){2} # repeated non-capturing group with words/non words
)
.* # anything else afterwards
(\w+)\W+\w+ # backtracking to the second last word character
$
Do:
^(\S+)\s+(\S+).*(\S+)\s+\S+$
The 3 captured groups capture the 3 desired potions
\S indicates any non-whitespace character
\s indicates any whitespace character
Demo
As you have used grep with PCRE in your example, i am assuming you have access to the GNU toolset. Using GNU sed:
% sed -E 's/^(\S+)\s+(\S+).*(\S+)\s+\S+$/\1 \2 \3/' <<<"CSC 101 Intro to Computing A R"
CSC 101 A
A whole RegEx pattern can't match disjointed groups.
I suggest taking a look at Capture Groups - basically you capture the two disjointed groups, the matched couples of words can then be used by referring to these two groups.
grep can't print out multiple capture groups so an example with sed is
echo 'CSC 101 Intro to Computing A R' | sed -n 's/^\(\w\{3\}\s[[:digit:]]\{3\}\).*\?\(\w\)\s\+\w$/\1 \2/p' which prints out CSC 101 A
Note that the pattern used here is ^(\w{3}\s\d{3}).*?(\w)\s+\w$

regex matching two groups of repeating digits where both are not allowed to be the same digits

Folks,
I'm trying to use regular expressions to process a large set of number strings and match digit sequences for particular patterns where some digits are repeated in groups. Part of the requirement is to ensure uniqueness between sections of the given pattern.
An example of the kind of matching I'm trying to achieve
ABBBCCDD
Interpret this as a set of digits. But A,B,C,D cannot be the same. And the repetition of each is the pattern we're trying to match.
I've been using regular expressions with negative look-ahead as part of this matching and it works but not all the time and I'm confused as to why. I'm hoping someone can explain why its glitching and suggest a solution.
So to address ABBBCCDD I came up with this RE using negative look-ahead using groups..
(.)(?!\1{1,7})(.)\2{2}(?!\2{1,4})(.)\3{1}(?!\3{1,2})(.)\4{1}
To break this down..
(.) single character wildcard group 1 (A)
(?!\1{1,7}) negative look-ahead for 1-7 occurrences of group 1 (A)
(.) single character wildcard group 2 (B)
\2{2} A further two occurrences of group 2 (B)
(?!\2{1,4}) Negative look-ahead of 1-4 occurrences of group 2 (B)
(.) single character wildcard group 3 (C)
\3{1} One more occurrence of group 3 (C)
(?!\3{1,2}) Negative look-ahead of 1-2 occurrences of group 3 (C)
(.) single character wildcard group 4 (D)
\4{1} one more occurrence of group 4 (D)
The thinking here is that the negative look-aheads act as a means of verifying that a given character is not found where it's unexpected. So A gets checked in the next 7 chars. Once B and it's 2 repetitions are matched, we're negativdely looking ahead for B in the next 4 chars. Finally once the pair of Cs is matched, we're looking in the final 2 for a C as a means of detecting a mismatch.
For test data, this string "01110033" matches the expression. But it shouldn't because the '0' for A is repeated in the C position.
I ran checks of this expression in Python and with grep in PCRE mode (-P). Both matched the wrong pattern.
I put the expression in https://regex101.com/ along with the same test string "01110033" and it also matched there. I don't have enough rating to post images of this or of variations I tried with the test data. So here are some text grabs from command-line runs with grep -P
So our invalid expression that repeats A in CC position gets through..
$ echo "01110033" | grep -P '(.)(?!\1{1,7})(.)\2{2}(?!\2{1,4})(.)\3{1}(?!\3{1,2})(.)\4{1}'
01110033
$
Changing DD to 11, copying BBB, we also find that gets through despite B having a forward negative check..
$ echo "01110011" | grep -P '(.)(?!\1{1,7})(.)\2{2}(?!\2{1,4})(.)\3{1}(?!\3{1,2})(.)\4{1}'
01110011
$
Now change DD to "00", copying the CC digits and low and behold it doesn't match..
$ echo "01110000" | grep -P '(.)(?!\1{1,7})(.)\2{2}(?!\2{1,4})(.)\3{1}(?!\3{1,2})(.)\4{1}'
$
Delete the forward-negative check for CC "(?!\3{1,2})" from the expression and our repeat of the C digit in the D position makes it through.
$ echo "01110000" | grep -P '(.)(?!\1{1,7})(.)\2{2}(?!\2{1,4})(.)\3{1}(.)\4{1}'
01110000
$
Back to the original test number and switch CC digits to the same use of '1' from B. It doesn't get through.
$ echo "01111133" | grep -P '(.)(?!\1{1,7})(.)\2{2}(?!\2{1,4})(.)\3{1}(?!\3{1,2})(.)\4{1}'
$
And to play this out for the BBB group, set the B digits to the same 0 as encountered for A. Also fails to match..
$ echo "00002233" | grep -P '(.)(?!\1{1,7})(.)\2{2}(?!\2{1,4})(.)\3{1}(?!\3{1,2})(.)\4{1}'
$
Then take out the negative lookahead for A and we can this to match..
$ echo "00002233" | grep -P '(.)(.)\2{2}(?!\2{1,4})(.)\3{1}(?!\3{1,2})(.)\4{1}'
00002233
$
So it seems to me that the forward negative check is working but that it only works with the next adjacent set or its intended lookahead range is cut short in some form presumably by the extra things we're trying to match.
If I add an additional lookahead on A right after B and its repetition have been processed, we get it to avoid matching on the CC part reusing the A digit..
$ echo "01110033" | grep -P '(.)(?!\1{1,7})(.)\2{2}(?!\1{1,4})(?!\2{1,4})(.)\3{1}(?!\3{1,2})(.)\4{1}'
$
To take this further, then after matching the CC set, I would need to repeat the negative lookaheads for A and B again. This just seems wrong.
Hopefully an RE expert can clarify what I'm doing wrong here or confirm if negative-lookahead is indeed limited based on what I'm observing
(.)(?!.{0,6}\1)(.)\2{2}(?!\2{1,4})(.)\3{1}(?!\3{1,2})(.)\4{1}
^^^^^^^^
Change your lookahead to disallow match when \1 appears anywhere in the string.See demo.You can similarly modify other parts as well in your regex.
https://regex101.com/r/vV1wW6/31
NOTE: updated.
As vks already noted, your negative lookaheads weren't excluding what you thought -- \1{1,7} for example is only going to exclude A, AA, AAA, AAAA, AAAAA, AAAAAA, and AAAAAAA. I think you want the lookaheads to be .*\1, .*\2, .*\3, etc.
But here's another idea: It's easy to prefilter out ANY line that has non-adjacent repeated characters:
grep -P -v '(.)(?!\1).*\1'
And then your regexp on the result is MUCH simpler: .{1}.{3}.{2}.{2}
And in fact the whole thing can be combined using the first as a negative pre-lookahead constraint:
(?!.*(.)(?!\1).*\1).{1}.{3}.{2}.{2}
Or if you need to capture the digits as you did originally:
(?!.*(.)(?!\1).*\1)(.){1}(.){3}(.){2}(.){2}
But note that those digits will now be \2 \3 \4 \5, since \1 is in the lookahead.
Based on the feedback so far, I'm giving another answer that does not rely on doing arithmetic based on total length and that will self-containedly identify any sequence of 4 unique character/digit groups in the length sequence 1,3,2,2 anywhere in a string:
/(?<=^|(.)(?!\1))(.)\2{0}(?!\2)(.)\3{2}(?!\2|\3)(.)\4{1}(?!\2|\3|\4)(.)\5{1}(?!\5)/gm
^^^^^^^^^^^^^^^^ this is a look-behind that makes sure we're starting with a new character/digit
^^^^^^^^ this is the size-1 group; yes the \2{0} is superfluous
^^^^^^ this ensures the next group is unique
^^^^^^^^ this is the size-3 group
etc.
Let me know if this is closer to your solution. If so, and if all of your "patterns" consist of sequences of the group sizes you're looking for (like 1,3,2,2), I can come up with some code that will generate the corresponding regexp for any such input "pattern".
just some details here on what the eventual solution looked like for me..
So fundamentally (?!\1{1,7}) was not what I had thought it would be and was the entire cause of the issues I had encountered. Sincere appreciations to you guys for finding that issue for me.
The example I had shown was 1 from about 50 I had to formulate from a set of patterns.
It ended up as..
ABBBCCDD
09(.)(?!.{0,6}\1)(.)\2{2}(?!.{0,3}\2)(.)\3{1}(?!.{0,1}\3)(.)\4{1}
So once \1 (A) was captured, I tested negative lookahead of 0-6 wildchars preceding A. Then I capture \2 (B), its two repetitions and then give B negative lookahead of 0-3 wilds + B and so on.
It keeps the focus oriented around looking forward negatively to make sure the caught groups do not repeat where they are not supposed to. Then the subsequent captures and their recurrence patterns will do the rest in ensuring the match.
Other examples from the final set:
ABCCDDDD
(.)(?!.{0,6}\1)(.)(?!.{0,5}\2)(.)\3{1}(?!.{0,3}\3)(.)\4{3}
AABBCCDD
(.)\1{1}(?!.{0,5}\1)(.)\2{1}(?!.{0,3}\2)(.)\3{1}(?!.{0,1}\3)(.)\4{1}
ABCCDEDE
09(.)(?!.{0,6}\1)(.)(?!.{0,5}\2)(.)\3{1}(?!.{0,3}\3)(.)(?!\4{1})(.)\4{1}\5{1}