Regex to capture page number from filename - regex

I have document page images named (for example) as follows:
“2020-07-24 07;17;09 - ABCD - 12345-67890 (14 Main St) - 01 [Declaration 1].png”
“2020-07-24 07;17;09 - ABCD - 12345-67890 (14 Main St) - 02 [Declaration 2].png”
“2020-07-24 07;17;09 - ABCD - 12345-67890 (14 Main St) - 07 [Fire].png”
“2020-07-24 07;17;09 - ABCD - 12345-67890 (14 Main St) - 12 [Fungi etc].png”
I want to capture ONLY the page numbers, without preceding zeros (1, 2, 7, 12 in this example). Based on code I saw here, I thought maybe something like this might take care of it:
- 0*\d+.*\.(?:jpe?g|png|tiff?)$(?!(?:0*)\d+)
…but, it did not. Any other suggestions?

You could use a capturing group for the digits:
- 0*(\d+) \[[^][]*]\.(?:jpe?g|png|tiff?)\b
Explanation
- 0* Match - a space and 0+ times a zero
(\d+) Capture group 1, match 1+ digits
[[^][]*] Match a space and from [ till ]
\.(?:jpe?g|png|tiff?)\b Match a dot and one of the alternatives
Regex demo
To capture the last digits without leading zeroes after the last occurrence of space dash space, you could use a negative lookahead:
- 0*(\d+)(?!.* - ).*\.(?:jpe?g|png|tiff?)$
Regex demo

So it looks like you want to end up at the last hyphen. Try:
-\h*(?!.*-)0*(\d+)
See the demo
-\h* - Match a literal hypen and zero or more horizontal whitespaces.
(?!.*-) - A negativ lookahead for zero or more characters and hyphen.
0* - Zero or more zeroes.
(\d+) - Capture at least a single digit into capture group 1.
End note: Please give credit where credit is due. Your question did not have the necessary details given later through comments. This answer is far more detailed based on what you provided in the OP.

Related

How to built a regexp to match optional patterns

I have the following strings sample:
MAREMMA TOSCANA BIANCO DOC 2020 CALASOLE MONTEMASSI0,750
CHIANTI CLASSICO DOCG 2012 RISERVA ALBOLA LT.0,750
I need to separate in 5 parts (where I put the | in the following samples:
MAREMMA TOSCANA BIANCO DOC |2020| CALASOLE MONTEMASSI|0,750
CHIANTI CLASSICO DOCG |2012| RISERVA ALBOLA |LT.|0,750
AS you can see, the fourth part is optional.
I tried some variation of this regexp on https://regex101.com/r/NX3DE3/1, but the LT. part is incorporated in the precedent one:
([A-Za-z ]+)((20\d\d)|(19\d\d))([A-Za-z ]*)((LT))\.?[0-9,]*
the ((LT)) group is optional, but if I add a ? it run in the first example, but is not in the second and viceversa.
I would also like to trim the different parts, but really don't know how!
You can use
^(.*?)\s*((?:20|19)\d\d)\s*(.*?)(?:\s+(LT)[. ])?(\d[\d,]*)
See the regex demo. Details:
^ - start of string
(.*?) - Group 1: any zero or more chars other than line break chars as few as possible
\s* - zero or more whitespaces
((?:20|19)\d\d) - Group 2: 20 or 19 and then two digits
\s* - zero or more whitespaces
(.*?) - Group 3: any zero or more chars other than line break chars as few as possible
(?:\s+(LT)[. ])? - an optional non-capturing group matching one or more whitespaces and then capturing into Group 4 LT and then a space or .
(\d[\d,]*) - Group 5: a digit and then zero or more digits or commas.

include searched regex text also in output

I'm using regex re.findall(r"[0-9]+(.*?)\.\s(.*?)[0-9]+", text) to get below text
8 EXT./INT. MONORAIL - MORNING 8
9 EXT. CITY SCAPE/MONORAIL - CONTINUOUS 9
But my current output doesn't have the prefix and suffix numbers. I'm trying to have the prefix digits also in the output as follows.
9 EXT. CITY SCAPE/MONORAIL - CONTINUOUS
Any help greatly appreciated! Thanks in advance.
(The current output is given below)
You can use
(?m)^([0-9]+)\s*(.*?)\.\s(.*?)(?:\s*([0-9]+))?$
See the regex demo. *Details:
(?m) - a multiline modifier
^ - start of string
([0-9]+) - Group 1: one or more digits
\s* - zero or more whitespaces
(.*?) - Group 2: zero or more chars other than line break chars as few as possible
\.\s - a dot and a whitespace
(.*?) - Group 3: zero or more chars other than line break chars as few as possible
(?:\s*([0-9]+))? - an optional occurrence of zero or more whitespaces and then Group 4 capturing one or more digits
$ - end of line.

Regex to block more than 3 numbers in a string

I am trying to block any strings that contain more than 3 numbers and prevent special characters. I have the special characters part down. I'm just missing the number part.
For example:
"Hello 1234" - Not Allowed
"Hello 123" - Allowed
I've tried the following:
/^[!?., A-Za-z0-9]+$/
/((^[!?., A-Za-z]\d)([0-9]{3}+$))/
/^((\d){2}[a-zA-Z0-9,.!? ])*$/
The last one is the closest I got as it prevents any special characters and any numbers from being entered at all.
I've looked through previous posts, but am coming up short.
Edit for clarification
Essentially I'm trying to find a way to prevent customers from entering PII on a form. No submission should be allowed that contains more than 3 numbers in a string.
Hello1234 - Not allowed
12345 - Not allowed
1111 - not allowed
No where in the comment section when the user enters the string should there be more than 3 numbers in total.
About the patterns that you tried
^[!?., A-Za-z0-9]+$ The pattern matches 1+ times any of the listed, including 1 or more digits
((^[!?., A-Za-z]\d)([0-9]{3}+$)) If {3}+ is supported, the pattern matches a single char from the character class, 1 digit followed by 3 digits
^((\d){2}[a-zA-Z0-9,.!? ])*$ The pattern repeats 0+ times matching 2 digits and 1 of the listed in the character class
You can use a negative lookahead if that is supported to assert not 4 digits in a row.
^(?!.*\d{4})[a-zA-Z0-9,.!? ]+$
regex demo
If there can not be 4 digits in total, but 0-3 occurrences:
^[a-zA-Z,.!? ]*(?:\d[a-zA-Z,.!? ]*){0,3}$
Explanation
^ Start of string
[a-zA-Z,.!? ]* Match 0+ times any of the listed (without a digit)
(?:\d[a-zA-Z,.!? ]*){0,3} Repeat 0 - 3 times matching a single digit followed by optional listed chars (Again without a digit)
$ End of string
regex demo
If you don't want to match an empty string and a lookahead is supported:
^(?!$)[a-zA-Z,.!? ]*(?:\d[a-zA-Z,.!? ]*){0,3}$
See another regex demo
Here is my two cents:
^(?!(.*\d){4})[A-Za-z ,.!?\d]+$
See the online demo
^ - Start string anchor.
(?! - Open a negative lookahead.
( - Open capture group.
.*\d - Match anything other than newline up to a digit.
){4} - Close capture group and match it 4 times.
) - Close negative lookahead.
[A-Za-z ,.!?\d]+ - 1+ Characters from specified class.
$ - End string anchor.
I think it should cover what you described.
Assuming you mean <= 3 digits, this may be a naive one but how about
[ALLOWED_CHARS]*[0-9]?[ALLOWED_CHARS]*[0-9]?[ALLOWED_CHARS]*[0-9][ALLOWED_CHARS]*?
Fill [ALLOWED_CHARS] to whatever you define is not special character and nums.

Regex match depending on lookbehind match

I need to match these values:
(First approach to a regex that roughly does what I want)
\d+([.,]\d{3})*[.,]\d{2}
like
24,56
24.56
1.234,56
1,234.56
1234,56
1234.56
but I need to not match
1.234.56
1,234,56
So somehow I need to check the last occurrence of "." or "," to not be the same as the previous "." or ",".
Background: Amounts shall be matched in English and German format with (optional) 1000-Separators.
But even with help of regex101 I completely fail at coming up with a correctly working look-behind. Any suggestions are highly appreciated.
UPDATE
Based on the answers I got so far, I came up with this (demo):
\d{1,3}(?:([\.,'])?\d{3})*(?!\1)[\.,\s]\d{2}
But it matches for example 1234.567,23 which is not desirable.
You may capture the digit grouping symbol and use a negative lookahead with a backreference to restrict the decimal separator:
^(?:\d+|\d{1,3}(?:([.,])\d{3})*)(?!\1)[.,]\d{2}$
^ ^ ^^^^^
See the regex demo
Group 1 will contain the last value of the digit grouping symbol and (?!\1)[.,] will match the other symbol.
Details:
^ - start of string
(?:\d+|\d{1,3}(?:([.,])\d{3})*) - either of the two alternatives:
\d+ - 1+ digits
| - or
\d{1,3} - 1 to 3 digits,
(?:([.,])\d{3})* - zero or more sequences of:
([.,]) - Group 1 capturing . or ,
\d{3} - 3 digits
(?!\1)[.,] - a . or , but not equal to what was last captured with ([.,]) pattern above
\d{2} - 2 digits
$ - end of string.
You can use
^\d+(([.,])\d{3})*(?!\2)[.,]\d{2}$
live demo

Live input validation of 4-digit prefixes with optional 8 digits at the end

it's my following regex:
/^08(17|18|19|31|32|33|38|59|77|78)[0-9]{0,8}$/
if i put 08 in input field will showing notice error, what I want is if i input 0817 will showing success, if i input 08 i want don't appear notice error. maybe the solution is to use don't capturing group in regex. but how to do that?
it's my prefix validation what i want:
0817, 0818, 0819, 0831, 0832, 0833, 0838, 0859, 0877, 0878
You want to implement a live input validation for your codes that consist of 4-digit set prefixes and then 0 to 8 arbitrary digits.
The point is that you cannot make the subpatterns optional sequentially, you need to use nested optional groups to require a left hand digit to be present before the right-hand one.
The pattern becomes rather untidy, but that is the only way to make it work:
^0(?:8(?:1(?:[789][0-9]{0,8})?|3(?:[1238][0-9]{0,8})?|5(?:9[0-9]{0,8})?|7(?:[78][0-9]{0,8})?)?)?$
See the regex demo
Details:
^ - start of string
0 - an obligatory 0
(?:
8 - obligatory 8
(?:
1 - obligatory 1 followed by...
(?: - an optional group matching either...
[789] - 7, or 8, or 9 followed with
[0-9]{0,8} - 0 to 8 any digits
)? - (end of the optional group after 1)
| - or
3(?:[1238][0-9]{0,8})? - (similar to above)
| - or
5(?:9[0-9]{0,8})? - (similar to above)
| - or
7(?:[78][0-9]{0,8})? - (similar to above)
)? - end of the optional group matching the 8 and all after it
)? - the whole part after the first 0 is optional.
$ - end of string.
You don't need a non-capturing group. You only need ? for an optional occurrence. It will consider valid for 0 or 1 occurrence.
/^08(17|18|19|31|32|33|38|59|77|78)?[0-9]{0,8}$/
Prove: https://regex101.com/r/xC7mT4/1