Regex repeated lines - regex

Suppose I have this String:
Speaker 1:
Lorem ipsum
Speaker 1:
This is text
Speaker 1:
Another one
Speaker 2:
Yadda Yadda
Speaker 1:
Text
Speaker 2:
New text
I want to to remove the second and third occurence of Speaker 1: but keep the first and fourth one via regex.
I tried using (Speaker 1:)(.|\n)*((Speaker 1:))(.|\n)*(Speaker 2:) to be able to access the groups but this didn't work out.
How can I access only the repeated lines containing Speaker 1: which are followed by Speaker 2:?

You might use a capture group to keep the first occurrence.
Then match all consecutive parts that start with the same Speaker , digits and : using a backreference.
In the replacement use group 1 to keep the first occurrence.
^((Speaker \d+:)(?:\n(?!Speaker ).*)*)(?:\n\2(?:\n(?!Speaker ).*)*)*
^ Start of string
( Capture group 1
(Speaker \d+:) Capture group 2 Match Speaker and 1+ digits
(?:\n(?!Speaker ).*)* Match all lines that do not start with Speaker
) Close group 1
(?: Non capture group
\n\2 Match a newline and a backreference to group 1
(?:\n(?!Speaker ).*)* Match a newline and all lines that do not start with Speaker
)* Close the non capture group and optionally repeat it
Regex demo

Related

How to match multi-line text using the right regex capture group?

I'm trying to read in a CSV and split each row using regex capture groups. The last column of the CSV has newline characters in it and my regex's second capture group seems to be breaking at the first occurrence of that newline character and not capturing the rest of the string.
Below is what I've managed to do so far. The first record always starts with ABC-, so I put that in my first capturing group and everything else after it, till the next occurrence of ABC- or end of file (if last record), should be captured by the second capturing group. The first row works as expected because there's no newline characters in it, but the rest won't.
My regex: ([A-Z1-9]+)-\d*,(.*)
My test string:
ABC-1,01/01/1974,X1,Y1,Z1,"RANDOM SINLGE LINE TEXT 1",
ABC-2,01/01/1974,X2,Y2,Z2,"THIS IS
A RANDOM
MULTI LINE
TEXT 2",
ABC-3,01/01/1974,X3,Y3,Z3,"THIS IS
ANOTHER RANDOM
MULTI LINE TEXT",
Expected result is:
3 matches
Match 1:
Group 1: ABC-1,
Group 2: 01/01/1974,X1,Y1,Z1,"RANDOM SINLGE LINE TEXT 1",
Match 2:
Group 1: ABC-2,
Group 2: 01/01/1974,X2,Y2,Z2,"THIS IS
A RANDOM
MULTI LINE
TEXT 2",
Match 3:
Group 1: ABC-3,
Group 2: 01/01/1974,X3,Y3,Z3,"THIS IS
ANOTHER RANDOM
MULTI LINE TEXT",
You can use
^([A-Z]+-\d+),(.*(?:\n(?![A-Z]+-\d+,).*)*)
See the regex demo. Only use it with the multiline flag (if it is not Ruby, as ^ already matches line start positions in Ruby).
Details:
^ - start of a line
([A-Z]+-\d+) - Group 1: one or more uppercase ASCII letters and then - and one or more digits
, - a comma
(.*(?:\n(?![A-Z]+-\d+,).*)*) - Group 2:
.* - the rest of the line
(?:\n(?![A-Z]+-\d+,).*)* - zero or more lines that do not start with one or more uppercase ASCII letters and then - and one or more digits + a comma
You can try to limit the second group by a looking-ahead assertion:
(ABC-\d+,)(.*?(?=^ABC|\z))
Demo here.

Regular expression with multiline matching (subtitles strings)

Need some help in regexp matching pattern.
The text goes like here (it's subtitles for video)
...
223
00:20:47,920 --> 00:20:57,520
- Hello! This is good subtitle text.
- Yes! How are you, stackoverflow?
224
00:20:57,520 --> 00:21:11,120
Wow, seems amazing.
- We're good, thanks.
Like, you know, everyone is happy around here with their laptops.
225
00:21:11,120 --> 00:21:14,440
- Understood. Some dumb text
...
I need a set of groups:
startTime, endTime, text
For now my achievements are not very good. I can get startTime, endTime and some text, but not all the text, only the last sentence. I've attached a screenshot.
As you can see, group 3 is capturing text, but only last sentence.
Please, explain me what I'm doing wrong.
Thank you.
Accounting for the possibility there is no new-line character after the final text of your string; Would the following work for you:
(\d\d:\d\d:\d\d,\d\d\d)[ >-]*?((?1))\n(.*?(?=\n\n|\Z))
See the online demo
(\d\d:\d\d:\d\d,\d\d\d) - The same pattern as you used to capture starting time in 1st capture group.
[ >-]*? - 0+ (but lazy) character from the character class up to:
((?1)) - A 2nd capture group which matches the same pattern as 1st group.
\n - A newline-character.
(.*?(?=\n\n|\Z)) - A 3rd capture group that captures anything (including newline with the s-flag) up to a positive lookahead for either two newline characters or the end of the whole string.
Note, some (not all) engines allow for backreferencing a previous subpattern. I guess the app you are using does not. Therefor you can swap the (?1) with your own pattern to capture the 2nd group.
Another option is to use a pattern that would capture all lines in group 3 that do not start with 3 digits.
(\d\d:\d\d:\d\d,\d\d\d) --> (\d\d:\d\d:\d\d,\d\d\d)((?:\r?\n(?!\d\d\d\b).*)*)
Explanation
(\d\d:\d\d:\d\d,\d\d\d) Capture group 1 Match a time like pattern
--> Match literally
(\d\d:\d\d:\d\d,\d\d\d) Capture group 2 Same pattern as group 1
( Capture group 3
(?: Non capture group
\r?\n(?!\d\d\d\b).* Match a newline and assert using a negative lookahead that the line does not start with 3 digits followed by word boundary. If that is the case, match the whole line
)* Optionally repeat all lines
) Close group 3
Regex demo
A bitmore specific pattern could be matching all lines that do not start with 3 digits or a start/end time like pattern.
^(\d\d:\d\d:\d\d,\d\d\d)[^\S\r\n]+-->[^\S\r\n]+(\d\d:\d\d:\d\d,\d\d\d)((?:\r?\n(?!\d+$|\d\d:\d\d:\d\d,\d\d\d\b).*)*)
Regex demo

How to include a substring EXCEPT an exact one in middle of REGEX expression?

Issue
I'm trying to match 3 groups, where one is conditional
String: 12345-12345-1230
Group 1: 12345-12345
Group 2: -123
Group 3: 0
However I only want to match Group 2 if the string is NOT "-000". Meaning group 2 will either be blank if that section is '-000' or it will be whatever else those 4 characters are; '-123' '-001', etc.).
Here is the REGEX with it just accepting anything as group 2:
^(.{5}-.{5})(.{4})([0-9])$ regex101
What I've tried
Negative Lookahead:
^(.{5}-.{5})(?!-000)([0-9])$
^(.{5}-.{5})(.{4}(?!.{4}))([0-9])$
OR Operator:
^(.{5}-.{5})(-000)|(.{4})([0-9])$
This is the closest I've come, however I can't get it to work WITH the final condition ([0-9])$. It's also not ideal to have the remove case (-000) as a separate group as the accept case (not -000).
You may try:
^(\d{5}-\d{5})(?:-000|(-\d{3}))(\d)$
See the online demo.
^ - Start of line ancor.
( - Open 1st capture group.
\d{5}-\d{5} - Match 5 digits, an hyphen, and again 5 digits.
) - Close 1st capture group.
(?: - Open non-capturing group.
-000 - Match "-000" literally.
| - Pipe symbol used as an or-operator.
( - Open 2nd capture group.
-\d{3} - match an hyphen and 3 digits.
) - Close 2nd capture group.
) - Close non-capturing group.
( - Open 3rd capture group.
(\d) - Match a single digit.
) - Close 3rd capture group.
$ - End line ancor.
If you want to capture the 2nd group without hypen, then try: ^(\d{5}-\d{5})-(?:000|(\d{3}))(\d)$
Try this:
(\d{5}-\d{5})(?!-000)(-\d{3})(0)
See Demo

Regex for all characters upto not including \n

Here I have a text string.
Serial#......... 12345678910123456\nCust#........... 654321\nCustomer Name... Some Customer\nBILL TO NO NAME. Bill To: 123456 - Some Company Pty Ltd\nDATE...... 01/01/00
I want to capture 2 parts of this string.
Cust#........... 654321 BILL TO NO NAME. Bill To: 123456 - Some Company Pty Ltd
using regex.
So far I have Cust#.*?\d+ which captures
Cust#........... 654321
However I dont think this is the best approach.
Note.. This is 1 string from thousands, so data within strings is dynamic, can I capture what is within end of line \n character to achieve my result??
Try this regex: ^.*?\n(.*?)\n.*?\n(.*?)\n.*$ at least it should give you a different way of looking at the problem.
It describes the entire string, using carriage returns as element delimiters. The parenthesis defines groups which you want to save, which are the 2nd and 4th groups.
Of course this depends on the elements you want always being the 2nd and 4th and being delimited by the newlines.
https://regex101.com/r/harmzn/1
You might use 2 capturing groups. In the first group, use your pattern without the lazy quantifier, as the digits are at the end of the line.
Then match (not capture) all the lines that do not start with BILL
After that, capture in group 2 the whole line that starts with BILL
^(Cust#.*\d+)(?:\r?\n(?!BILL ).*)*\r?\n(BILL .*)
Explanation
^ Start of string
( Capture group 1
Cust#.*\d+ The pattern to match Cust# with the digits at the end
) Close group
(?:\r?\n(?!BILL ).*)*\r?\n Match all lines that do not start with BILL
( Capture group 2
BILL .* Match the line that starts with BILL
) Close group
Regex demo

Regex - optional capture group after wildcard

Say I have the following list:
No 1 And Your Bird Can Sing (4)
No 2 Baby, You're a Rich Man (5)
No 3 Blue Jay Way S
No 4 Everybody's Got Something to Hide Except Me and My Monkey (1)
And I want to extract the number, the title and the number of weeks in the parenthesis if it exists.
Works, but the last group is not optional (regstorm):
No (?<no>\d{1,3}) (?<title>.*?) \((?<weeks>\d)\)
Last group optional, only matches number (regstorm):
No (?<no>\d{1,3}) (?<title>.*?)( \((?<weeks>\d)\))?
Combining one pattern with week capture with a pattern without week capture works, but there gotta be a better way:
(No (?<no>\d{1,3}) (?<title>.*) \((?<weeks>\d)\))|(No (?<no>\d{1,3}) (?<title>.*))
I use C# and javascript but I guess this is a general regex question.
Your regex is almost there!
First and most importantly, you should add a $ at the end. This makes (?<title>.*?) match all the way towards the end of the string. Currently, (?<title>.*?) matches an empty string and then stops, because it realises that it has reached a point where the rest of the regex matches. Why does the rest of the regex match? Because the optional group can match any empty string. By putting the $, you are making the rest of the regex "harder" to match.
Secondly, you forgot to match an open parenthesis \(.
This is how your regex should look like:
No (?<no>\d{1,3}) (?<title>.*?)( \((?<weeks>\d)\))?$
Demo
You may use this regex with an optional last part:
^No (?<no>\d{1,3}) (?<title>.*?\S)(?: \((?<weeks>\d)\))?$
RegEx Demo
Another option could be for the title to match either not ( or when it does encounter a ( it should not be followed by a digit and a closing parenthesis.
^No (?<no>\d{1,3}) (?<title>(?:[^(\r\n]+|\((?!\d\)))+)(?:\((?<weeks>\d)\))?
In parts
^No
(?\d{1,3}) Group no and space
(?<title>
(?: Non capturing group
[^(\r\n]+ Match any char except ( or newline
| Or
\((?!\d\)) Match ( if not directly followed by a digit and )
)+ Close group and repeat 1+ times
) Close group title
(?: Non capturing group
\((?<weeks>\d)\) Group weeks between parenthesis
)? Close group and make it optional
Regex demo
If you don't want to trim the last space of the title you could exclude it from matching before the weeks.
Regex demo