Looking for regex to match before and after a number - regex

Given the string
170905-CBM-238.pdf
I'm trying to match 170905-CBM and .pdf so that I can replace/remove them and be left with 238.
I've searched and found pieces that work but can't put it all together.
This-> (.*-) will match the first section and
This-> (.[^/.]+$) will match the last section
But I can't figure out how to tie them together so that it matches everything before, including the second dash and everything after, including the period (or the extension) but does not match the numbers between.
help :) and thank you for your kind consideration.

There are several options to achieve what you need in Nintex.
If you use Extract operation, use (?<=^.*-)\d+(?=\.[^.]*$) as Pattern.
See the regex demo.
Details
(?<=^.*-) - a positive lookbehind requiring, immediately to the left of the current location, the start of string (^), then any 0+ chars other than LF as many as possible up to the last occurrence of - and the subsequent subpatterns
\d+ - 1 or more digits
(?=\.[^.]*$) - a positive lookahead requiring, immediately to the right of the current location, the presence of a . and 0+ chars other than . up to the end of the string.
If you use Replace text operation, use
Pattern: ^.*-([0-9]+)\.[^.]+$
Replacement text: $1
See another regex demo (the Context tab shows the result of the replacement).
Details
^ - a start of string anchor
.* - any 0+ chars other than LF up to the last occurrence of the subsequent subpatterns...
- - a hyphen
([0-9]+) - Group 1: one or more ASCII digits
\. - a literal .
[^.]+ - 1 or more chars other than .
$ - end of string.
The replacement $1 references the value stored in Group 1.

I don't know ninetex regex, but a sed type regex:
$ echo "170905-CBM-238.pdf" | sed -E 's/^.*-([0-9]*)\.[^.]*$/\1/'
238
Same works in Perl:
$ echo "170905-CBM-238.pdf" | perl -pe 's/^.*-([0-9]*)\.[^.]*$/$1/'
238

Related

Regex (PCRE): Match all digits in a line following a line which includes a certain string

Using PCRE, I want to capture only and all digits in a line which follows a line in which a certain string appears. Say the string is "STRING99". Example:
car string99 house 45b
22 dog 1 cat
women 6 man
In this case, the desired result is:
221
As asked a similar question some time ago, however, back then trying to capture the numbers in the SAME line where the string appears ( Regex (PCRE): Match all digits conditional upon presence of a string ). While the question is similar, I don't think the answer, if there is one at all, will be similar. The approach using the newline anchor ^ does not work in this case.
I am looking for a single regular expression without any other programming code. It would be easy to accomplish with two consecutive regex operations, but this not what I'm looking for.
Maybe you could try:
(?:\bstring99\b.*?\n|\G(?!^))[^\d\n]*\K\d
See the online demo
(?: - Open non-capture group:
\bstring99\b - Literally match "string99" between word-boundaries.
.*?\n - Lazy match up to (including) nearest newline character.
| - Or:
\G(?!^) - Asserts position at the end of the previous match but prevent it to be the start of the string for the first match using a negative lookahead.
) - Close non-capture group.
[^\d\n]* - Match 0+ non-digit/newline characters.
\K - Resets the starting point of the reported match.
\d - Match a digit.

Regex to check number of spaces after full stop - Strictly 2 required

I need to check occurrences where I have put one whitespace after a full-stop, and replace it by 2 spaces. I have the Regex for it, but Atom seems to call in invalid.
(?<=\.|\") {1,}(?=[a-zA-Z])
Conditions:
1 spaces after period.
If period in with a closing double quote, then 1 space after the quote.
The above regex works perfectly for my conditions however Atom is not able to validate it. I need to use it for existing files.
You may use
([."]) ([a-zA-Z])
and replace with $1 $2. See the regex demo and a regex graph:
Details
([."]) - Group 1 (its value is referred to with $1 backreference from the replacement pattern): . or "
- a space (use \s to match any whitespace)
([a-zA-Z]) - Group 2 ($2): an ASCII letter.

last year occurrence from string

I have strings like this:
ACB 01900 X1911D 1910 1955-2011 3424 2135 1934 foobar
I'm trying to get the last occurrence of a single year (from 1900 to 2050), so I need to extract only 1934 from that string.
I'm trying with:
grep -P -o '\s(19|20)[0-9]{2}\s(?!\s(19|20)[0-9]{2}\s)'
or
grep -P -o '((19|20)[0-9]{2})(?!\s\1\s)'
But it matches: 1910 and 1934
Here's the Regex101 example:
https://regex101.com/r/UetMl0/3
https://regex101.com/r/UetMl0/4
Plus: how can I extract the year without the surrounding spaces without doing an extra grep to filter them?
Have you ever heard this saying:
Some people, when confronted with a problem, think
“I know, I'll use regular expressions.” Now they have two problems.
Keep it simple - you're interested in finding a number between 2 numbers so just use a numeric comparison, not a regexp:
$ awk -v min=1900 -v max=2050 '{yr=""; for (i=1;i<=NF;i++) if ( ($i ~ /^[0-9]{4}$/) && ($i >= min) && ($i <= max) ) yr=$i; print yr}' file
1934
You didn't say what to do if no date within your range is present so the above outputs a blank line if that happens but is easily tweaked to do anything else.
To change the above script to find the first instead of the last date is trivial (move the print inside the if), to use different start or end dates in your range is trivial (change the min and/or max values), etc., etc. which is a strong indication that this is the right approach. Try changing any of those requirements with a regexp-based solution.
I don't see a way to do this with grep because it doesn't let you output just one of the capture groups, only the whole match.
Wit perl I'd do something like
perl -lpe 'if (/^.*\b(19\d\d|20(?:0-4\d|50))\b/) { print $1 }'
Idea: Use ^.* (greedy) to consume as much of the string up front as possible, thus finding the last possible match. Use \b (word boundary) around the matched number to prevent matching 01900 or X1911D. Only print the first capture group ($1).
I tried to implement your requirement of 1900-2050; if that's too complicated, ((?:19|20)\d\d) will do (but also match e.g. 2099).
The regex to do your task using grep can be as follows:
\b(?:19\d{2}|20[0-4]\d|2050)\b(?!.*\b(?:19\d{2}|20[0-4]\d|2050)\b)
Details:
\b - Word boundary.
(?: - Start of a non-capturing group, needed as a container for
alternatives.
19\d{2}| - The first alternative (1900 - 1999).
20[0-4]\d| - The second alternative (2000 - 2049).
2050 - The third alternative, just 2050.
) - End of the non-capturing group.
\b - Word boundary.
(?! - Negative lookahead for:
.* - A sequence of any chars, meaning actually "what follows
can occur anywhere further".
\b(?:19\d{2}|20[0-4]\d|2050)\b - The same expression as before.
) - End of the negative lookahead.
The word boundary anchors provide that you will not match numbers - parts
of longer words, e.g. X1911D.
The negative lookahead provides that you will match just the last
occurrence of the required year.
If you can use other tool than grep, supporting call to a previous
numbered group (?n), where n is the number of another capturing
group, the regex can be a bit simpler:
(\b(?:19\d{2}|20[0-4]\d|2050)\b)(?!.*(?1))
Details:
(\b(?:19\d{2}|20[0-4]\d|2050)\b) - The regex like before, but
enclosed within a capturing group (it will be "called" later).
(?!.*(?1)) - Negative lookahead for capturing group No 1,
located anywhere further.
This way you avoid writing the same expression again.
For a working example in regex101 see https://regex101.com/r/fvVnZl/1
You may use a PCRE regex without any groups to only return the last occurrence of a pattern you need if you prepend the pattern with ^.*\K, or, in your case, since you expect a whitespace boundary, ^(?:.*\s)?\K:
grep -Po '^(?:.*\s)?\K(?:19\d{2}|20(?:[0-4]\d|50))(?!\S)' file
See the regex demo.
Details
^ - start of line
(?:.*\s)? - an optional non-capturing group matching 1 or 0 occurrences of
.* - any 0+ chars other than line break chars, as many as possible
\s - a whitespace char
\K - match reset operator discarding the text matched so far
(?:19\d{2}|20(?:[0-4]\d|50)) - 19 and any two digits or 20 followed with either a digit from 0 to 4 and then any digit (00 to 49) or 50.
(?!\S) - a whitespace or end of string.
See an online demo:
s="ACB 01900 X1911D 1910 1955-2011 3424 2135 1934 foobar"
grep -Po '^(?:.*\s)?\K(?:19\d{2}|20(?:[0-4]\d|50))(?!\S)' <<< "$s"
# => 1934

RegExp checking for sign only if there is text afterwards

I have some cases, which I need to filter with a regex. The values which need to be filtered are listed below:
// These should be catched
123456_Test.pdf
123456 Test.pdf
123456.pdf
// These shouldn't be catched
123456Abcasd.pdf
123456-Abcasd.pdf
123456_.pdf
The current regEx looks like this:
(\d{6,7})((\_| ){0,1})(.*)\..*
The problem here is, that the latter 3 are also matched. To give you a short overview, whats wrong with the 1st "wrongly" matched strings:
The 1st capture-group has to consist 6-7 digits. (Also the capture-group is needed in the end). If there are letters after these numbers, there has to be a whitespace or underscore. The 1st example of the "shouldn't be catched" shows this. The entry is invalid, since there are letters after 123456 without the needed sign.
The last entry isn't really important, just there for convinience.
What am I missing? How do I adjust my regex in a way, that I can check for signs, only if there are letters following a number-chain?
You may use
^(\d{6,7})([_ ][A-Za-z].*)?\..*$
See the regex demo
Details
^ - start of a string
(\d{6,7}) - Group 1: 6 or 7 digits
([_ ][A-Za-z].*)? - an optional capturing group #2: a _ or space followed with a letter and then any 0+ chars as many as possible, up to the last
\. - . on a line
.* - the rest of the line
$ - end of string.
Check if this perl solution works for you.
> cat regex_catch.dat
123456_Test.pdf
123456 Test.pdf
123456.pdf
123456Abcasd.pdf
123456-Abcasd.pdf
123456_.pdf
> perl -ne ' print if m/\d+(([ _])[a-zA-Z]+| [a-zA-Z]*)?\.pdf/ ' regex_catch.dat
123456_Test.pdf
123456 Test.pdf
123456.pdf
>

Get the first ocurrence of a string in a variable REGEX

I have the following variable in a database: PSC-CAMPO-GRANDE-I08-V00-C09-H09-IPRMKT and I want to split it into two variables, the first will be PSC-CAMPO-GRANDE-I08 and the second V00-C09-H09-IPRMKT.
I'm trying the regex .*(\-I).*(\-V), this doesn't work. Then I tried .*(\-I), but it gets the last -IPRMKT string.
Then my question is: There a way of split the string PSC-CAMPO-GRANDE-I08-V00-C09-H09-IPRMKT considering the first occurrence of -I?
This should do the trick:
regex = "(.*?-I[\d]{2})-(.*)"
Here is test script in Python
import re
regex = "(.*?-I[\d]{2})-(.*)"
match = re.search(regex, "PSC-CAMPO-GRANDE-I08-V00-C09-H09-IPRMKT")
if match:
print ("yep")
print (match.group(1))
print (match.group(2))
else:
print ("nope")
In the regex, I'm grabbing everything up to the first -I then 2 numbers. Then match but don't capture a -. Then capture the rest. I can help tweak it if you have more logic that you are trying to do.
You may use
^(.*?-I[^-]*)-(.*)
See the regex demo
Details:
^ - start of a string
(.*?-I[^-]*) - Group 1:
.*? - any 0+ 0+ chars other than line break chars up to the first (because *? is a lazy quantifier that matches up to the first occurrence)
-I - a literal substring -I
[^-]* - any 0+ chars other than a hyphen (your pattern was missing it)
- - a hyphen
(.*) - Group 2: any 0+ chars other than line break chars up to the end of a line.