include searched regex text also in output - regex

I'm using regex re.findall(r"[0-9]+(.*?)\.\s(.*?)[0-9]+", text) to get below text
8 EXT./INT. MONORAIL - MORNING 8
9 EXT. CITY SCAPE/MONORAIL - CONTINUOUS 9
But my current output doesn't have the prefix and suffix numbers. I'm trying to have the prefix digits also in the output as follows.
9 EXT. CITY SCAPE/MONORAIL - CONTINUOUS
Any help greatly appreciated! Thanks in advance.
(The current output is given below)

You can use
(?m)^([0-9]+)\s*(.*?)\.\s(.*?)(?:\s*([0-9]+))?$
See the regex demo. *Details:
(?m) - a multiline modifier
^ - start of string
([0-9]+) - Group 1: one or more digits
\s* - zero or more whitespaces
(.*?) - Group 2: zero or more chars other than line break chars as few as possible
\.\s - a dot and a whitespace
(.*?) - Group 3: zero or more chars other than line break chars as few as possible
(?:\s*([0-9]+))? - an optional occurrence of zero or more whitespaces and then Group 4 capturing one or more digits
$ - end of line.

Related

How to built a regexp to match optional patterns

I have the following strings sample:
MAREMMA TOSCANA BIANCO DOC 2020 CALASOLE MONTEMASSI0,750
CHIANTI CLASSICO DOCG 2012 RISERVA ALBOLA LT.0,750
I need to separate in 5 parts (where I put the | in the following samples:
MAREMMA TOSCANA BIANCO DOC |2020| CALASOLE MONTEMASSI|0,750
CHIANTI CLASSICO DOCG |2012| RISERVA ALBOLA |LT.|0,750
AS you can see, the fourth part is optional.
I tried some variation of this regexp on https://regex101.com/r/NX3DE3/1, but the LT. part is incorporated in the precedent one:
([A-Za-z ]+)((20\d\d)|(19\d\d))([A-Za-z ]*)((LT))\.?[0-9,]*
the ((LT)) group is optional, but if I add a ? it run in the first example, but is not in the second and viceversa.
I would also like to trim the different parts, but really don't know how!
You can use
^(.*?)\s*((?:20|19)\d\d)\s*(.*?)(?:\s+(LT)[. ])?(\d[\d,]*)
See the regex demo. Details:
^ - start of string
(.*?) - Group 1: any zero or more chars other than line break chars as few as possible
\s* - zero or more whitespaces
((?:20|19)\d\d) - Group 2: 20 or 19 and then two digits
\s* - zero or more whitespaces
(.*?) - Group 3: any zero or more chars other than line break chars as few as possible
(?:\s+(LT)[. ])? - an optional non-capturing group matching one or more whitespaces and then capturing into Group 4 LT and then a space or .
(\d[\d,]*) - Group 5: a digit and then zero or more digits or commas.

Regex: match patterns starting from the end of string

I wish to match a filename with column and line info, eg.
\path1\path2\a_file.ts:17:9
//what i want to achieve:
match[1]: a_file.ts
match[2]: 17
match[3]: 9
This string can have garbage before and after the pattern, like
(at somewhere: \path1\path2\a_file.ts:17:9 something)
What I have now is this regex, which manages to match column and line, but I got stuck on filename capturing part.. I guess negative lookahead is the way to go, but it seems to match all previous groups and garbage text in the end of string.
(?!.*[\/\\]):(\d+):(\d+)\D*$
Here's a link to current implementation regex101
You can replace the lookahead with a negated character class:
([^\/\\]+):(\d+):(\d+)\D*$
See the regex demo. Details:
([^\/\\]+) - Group 1: one or more chars other than / and \
: - a colon
(\d+) - Group 2: one or more digits
: - a colon
(\d+) - Group 3: one or more digits
\D*$ - zero or more non-digit chars till end of string.

Regex to capture page number from filename

I have document page images named (for example) as follows:
“2020-07-24 07;17;09 - ABCD - 12345-67890 (14 Main St) - 01 [Declaration 1].png”
“2020-07-24 07;17;09 - ABCD - 12345-67890 (14 Main St) - 02 [Declaration 2].png”
“2020-07-24 07;17;09 - ABCD - 12345-67890 (14 Main St) - 07 [Fire].png”
“2020-07-24 07;17;09 - ABCD - 12345-67890 (14 Main St) - 12 [Fungi etc].png”
I want to capture ONLY the page numbers, without preceding zeros (1, 2, 7, 12 in this example). Based on code I saw here, I thought maybe something like this might take care of it:
- 0*\d+.*\.(?:jpe?g|png|tiff?)$(?!(?:0*)\d+)
…but, it did not. Any other suggestions?
You could use a capturing group for the digits:
- 0*(\d+) \[[^][]*]\.(?:jpe?g|png|tiff?)\b
Explanation
- 0* Match - a space and 0+ times a zero
(\d+) Capture group 1, match 1+ digits
[[^][]*] Match a space and from [ till ]
\.(?:jpe?g|png|tiff?)\b Match a dot and one of the alternatives
Regex demo
To capture the last digits without leading zeroes after the last occurrence of space dash space, you could use a negative lookahead:
- 0*(\d+)(?!.* - ).*\.(?:jpe?g|png|tiff?)$
Regex demo
So it looks like you want to end up at the last hyphen. Try:
-\h*(?!.*-)0*(\d+)
See the demo
-\h* - Match a literal hypen and zero or more horizontal whitespaces.
(?!.*-) - A negativ lookahead for zero or more characters and hyphen.
0* - Zero or more zeroes.
(\d+) - Capture at least a single digit into capture group 1.
End note: Please give credit where credit is due. Your question did not have the necessary details given later through comments. This answer is far more detailed based on what you provided in the OP.

regex to match pattern followed some string

I have following text. I want to capture the pattern ddd-dd-ddd followed by all text until I again hit a ddd-dd-ddd.
I am trying to use this regex
\b[0-9]{3}-[0-9]{2}-[0-9]{3}\b.*
it matches 982-99-122 followed by the sentence until it hits a line feed. then again the second number 586-33-453 is matched followed by the text on the same line. but it fails to capture the text that continues on the next line.
OR if I remove the line feed from this string, it will only capture the first number 982-99-122 and captures the whole string i.e. does not match the second number 586-33-453
How should I fix both these issues, 1. when line feeds are part of the string and 2. when the string does not have line feeds.
982-99-122 (FCC 333/22) lube oil service pump 1b discharge lube oil service pump
aaa bb dsdsd
586-33-453 Matches exactly 3 times 0-e single character in the range between
dfldfldflkdf 545-66-666 sdkjsl () jdfkjd-kfdkf sdfl
848-99-040 sdsd"" df
dfdf
It seems you want
\b([0-9]{3}-[0-9]{2}-[0-9]{3})\b([\s\S]*?)(?=\b[0-9]{3}-[0-9]{2}-[0-9]{3}\b|$)?
See the regex demo
Details
\b - word boundary
([0-9]{3}-[0-9]{2}-[0-9]{3}) - 3 digits, -, 2 digits, - and 3 digits
\b - word boundary
([\s\S]*?) - Group 2: any 0+ chars, as few as possible
(?=\b[0-9]{3}-[0-9]{2}-[0-9]{3}\b|$)? - a positive lookahead that requires 3 diigts, -, 2 digits, - and 3 digits as a whole word or end of string immediately to the right of the current location.

Matching numbers with an optional delimiter in between

So i got this regex code /[0-3][0-9][0-1][1-9]\d{2}[-\s]\d{4}?[^0-9]*|[0-3][0-9][0-1][1-9]\d{2}\d{4}/
This regex code take this kind of numbers:
1002821187
100282 1187
100282-1187
But i found out i dont want the numbers: 1002821187
So is it possible to make 1 regex code that only finds:
100282 1187
100282-1187
Your regex contains an alternation that matches the numbers with and without the space or -. You need to require that space or hyphen:
^[0-3][0-9][0-1][1-9][0-9]{2}[-\s][0-9]{4}$
^^^^^
See the regex demo. If you do not need to check for any boundaries, remove ^ and $ anchors that make the pattern match the whole string and use [0-3][0-9][0-1][1-9][0-9]{2}[-\s][0-9]{4}. Or use word boundaries to find whole words, \b[0-3][0-9][0-1][1-9][0-9]{2}[-\s][0-9]{4}\b.
Details
^ - start of string
[0-3] - a digit from 0 to 3
[0-9] - any digit
[0-1] (=[01]) - 0 or 1
[1-9] - any digit other than 0
[0-9]{2} - 2 digits
[-\s] - a - or whitespace
[0-9]{4} - 4 digits
$ - end of string.
Though you did not specify exactly what you are trying to do, here you can try the following regex :
\b\d{6}-\d{4}\b|\b\d{4}\b|\b\d{6}\b
Hope it helps.