How to match multi-line text using the right regex capture group? - regex

I'm trying to read in a CSV and split each row using regex capture groups. The last column of the CSV has newline characters in it and my regex's second capture group seems to be breaking at the first occurrence of that newline character and not capturing the rest of the string.
Below is what I've managed to do so far. The first record always starts with ABC-, so I put that in my first capturing group and everything else after it, till the next occurrence of ABC- or end of file (if last record), should be captured by the second capturing group. The first row works as expected because there's no newline characters in it, but the rest won't.
My regex: ([A-Z1-9]+)-\d*,(.*)
My test string:
ABC-1,01/01/1974,X1,Y1,Z1,"RANDOM SINLGE LINE TEXT 1",
ABC-2,01/01/1974,X2,Y2,Z2,"THIS IS
A RANDOM
MULTI LINE
TEXT 2",
ABC-3,01/01/1974,X3,Y3,Z3,"THIS IS
ANOTHER RANDOM
MULTI LINE TEXT",
Expected result is:
3 matches
Match 1:
Group 1: ABC-1,
Group 2: 01/01/1974,X1,Y1,Z1,"RANDOM SINLGE LINE TEXT 1",
Match 2:
Group 1: ABC-2,
Group 2: 01/01/1974,X2,Y2,Z2,"THIS IS
A RANDOM
MULTI LINE
TEXT 2",
Match 3:
Group 1: ABC-3,
Group 2: 01/01/1974,X3,Y3,Z3,"THIS IS
ANOTHER RANDOM
MULTI LINE TEXT",

You can use
^([A-Z]+-\d+),(.*(?:\n(?![A-Z]+-\d+,).*)*)
See the regex demo. Only use it with the multiline flag (if it is not Ruby, as ^ already matches line start positions in Ruby).
Details:
^ - start of a line
([A-Z]+-\d+) - Group 1: one or more uppercase ASCII letters and then - and one or more digits
, - a comma
(.*(?:\n(?![A-Z]+-\d+,).*)*) - Group 2:
.* - the rest of the line
(?:\n(?![A-Z]+-\d+,).*)* - zero or more lines that do not start with one or more uppercase ASCII letters and then - and one or more digits + a comma

You can try to limit the second group by a looking-ahead assertion:
(ABC-\d+,)(.*?(?=^ABC|\z))
Demo here.

Related

Regex repeated lines

Suppose I have this String:
Speaker 1:
Lorem ipsum
Speaker 1:
This is text
Speaker 1:
Another one
Speaker 2:
Yadda Yadda
Speaker 1:
Text
Speaker 2:
New text
I want to to remove the second and third occurence of Speaker 1: but keep the first and fourth one via regex.
I tried using (Speaker 1:)(.|\n)*((Speaker 1:))(.|\n)*(Speaker 2:) to be able to access the groups but this didn't work out.
How can I access only the repeated lines containing Speaker 1: which are followed by Speaker 2:?
You might use a capture group to keep the first occurrence.
Then match all consecutive parts that start with the same Speaker , digits and : using a backreference.
In the replacement use group 1 to keep the first occurrence.
^((Speaker \d+:)(?:\n(?!Speaker ).*)*)(?:\n\2(?:\n(?!Speaker ).*)*)*
^ Start of string
( Capture group 1
(Speaker \d+:) Capture group 2 Match Speaker and 1+ digits
(?:\n(?!Speaker ).*)* Match all lines that do not start with Speaker
) Close group 1
(?: Non capture group
\n\2 Match a newline and a backreference to group 1
(?:\n(?!Speaker ).*)* Match a newline and all lines that do not start with Speaker
)* Close the non capture group and optionally repeat it
Regex demo

Regex for replacing everything after a keyword with colon up to any other keyword with colon

I have the following type of strings:
This is a test: 1, two again,three test2: what is, this
test: acid, kool-aid word: some more info
Another test: face, 3, & yes
What I'd like to do is remove test: and everything after until it hits another word that has a colon.
The result set from above would look like:
This is a test2: what is, this
word: some more info
Another
Here's what I've attempted, but this fails when there is NO word with a colon (so example 3 fails)
test:.+?(?=\w+:)
You can use this regex for matching:
*\btest:.*?\b(?=\w+:|$)
And replace with empty string.
RegEx Demo
RegEx Details:
*: Match 0 or more spaces
\btest: Match full word test:
.*?\b: Match 0 or more of any characters (lazy match) followed by a word boundary
(?=\w+:|$): Positive lookahead to assert that we have a word + : or end of line ahead.
With your shown samples, please try following regex. This will create 1 to 2 capturing groups, this is having 3 matches 1st from starting to just before text with colon's 1st occurrence comes, 2nd match: From text followed by colon to next occurrence of text followed by colon(no capturing group is created for this match). 3rd match: rest of the value. So in case line has only 2 matches found(nothing in value after 2nd occurrence of text colon) then it will create 1 capturing group else it will be having 2 capturing groups. Perform substitution accordingly.
^(.*?)\s*\w+:.*?(?:\w+:|$)\s*(.*)$
Online demo for above regex
You were on the right track. For the last case where there is no second word with a colon, you need to match on the end-of-line character $. So you can use:
test:.*?(?=$|\b\w+:).
Demo

How to extract text before several specified alphanumeric whole words from string in plsql

How to remove all characters after specific alphanumeric value from string
for example "covid19 1st case" should be "covid19" if we remove string after 1st;
in case of "covid19 2d case" it should be "covid19" if we remove string after 2d
I am trying below query
select regexp_substr('covid19 1st case','[^1st]*') from dual;
but its giving covid as output any lead.
if we have predefine alphanumeric values can we do it in single expression
like we can remove all string after 1st and 2d.
Thanks
You can use
select regexp_substr('covid19 1st case','^(.*?)\s+(1st|2d)($|\W)', 1, 1, NULL, 1) from dual;
select regexp_substr('covid19 1st case','^(.*?)\s*(^|\W)(1st|2d)($|\W)', 1, 1, NULL, 1) from dual;
See the regex demo #1 and regex demo #2.
The (^|\W) and ($|\W) are used instad of word boundaries that are not supported by Oracle SQL regex engine.
Details:
^ - start of string
(.*?) - Group 1: any zero or more chars other than line break chars as few as possible
\s* - zero or more whitespaces (\s+ matches one or more)
(^|\W) - Group 2: start of string or a non-word char
(1st|2d) - Group 3: either 1st or 2d
($|\W) - Group 4: end of string or a non-word char.
Another variation is using REGEXP_REPLACE (you just need to match the rest of the string):
select regexp_replace('covid19 1st case','^(.*?)\s*(\W|^)(1st|2d)(\W|$).*', '\1') from dual;
See this regex demo, \1 refers to the Group 1 value.

google sheets regextract nth occurence on new line

Example Data (all in one google sheet cell)
Test# 123
Bob# abc
how are you doing
John# test
... # ...
My goal is to return everything after the third # so in this example " test" I have fiddled with a lot of examples online but they seem to be incompatible with the google sheets version of regextract for some reason.
I can use (\n.*){4} to return the fourth line but that is no good because as you can see from the example data I do not know how many lines of data will be between a # and also that is not extracting from the # let alone the third #.
Goal: extract the third # to the end of the line including or excluding the # will do.
Here is an idea I have but surely the format is all butchered ((\n)(?=#.*)){3} I would expect this regex to grab the third # IF it was the beginning of the line but I can't even get that working let alone if it were to occur in the middle of the line was as was my example.
You may use
=REGEXEXTRACT(A1, "^(?:[^#]*#){3}(.+)")
To get all text from the 3rd till the 4th # char:
=REGEXEXTRACT(A1, "^(?:[^#]*#){3}([^#]+)")
Details
^ - start of string
(?:[^#]*#){3} - three occurrences of 0+ chars other than # and then # (the (?:...) is a non-capturing group, you need it to group a sequence of patterns without the need to return the text this group pattern matches)
(.+) - Capturing group 1 (REGEXEXTRACT returns the text captured into the group only if the group is specified): any 1+ chars other than line break chars
([^#]+) - this captures into Group 1 any one or more chars other than #.
also works:
=REGEXEXTRACT(INDEX(SPLIT(A1, "#"),,4), " (.+)")

I need to combine multiple lines starting with the same ID

I have multiple lines in a text file that I need to combine together. The file is about 200 million lines long, so opening it with Excel and using their built-in tools is out of the picture.
The first set of lines looks like this:
1,example#gmail.com,Username
3,example#gmail.com,Username
4,example#gmail.com,Username
5,example#gmail.com,Username
9,example#gmail.com,Username
10,example#gmail.com,Username
Second set which I want to add at the last line of the first set is:
1,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
3,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
4,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
5,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
9,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
10,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
If anyone has experience with this, I'd love some help
Code
Regex
^(\d+),(.*$)(?=[\s\S]*^\1,(.*))
Formatting output
$1,$2,$3
Results
Input
1,example#gmail.com,Username
3,example#gmail.com,Username
4,example#gmail.com,Username
5,example#gmail.com,Username
9,example#gmail.com,Username
10,example#gmail.com,Username
1,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
3,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
4,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
5,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
9,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
10,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
Output
1,example#gmail.com,Username,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
3,example#gmail.com,Username,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
4,example#gmail.com,Username,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
5,example#gmail.com,Username,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
9,example#gmail.com,Username,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
10,example#gmail.com,Username,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
Explanation
^ Assert position at the start of a line
(\d+) Capture one or more digits into capture group 1
, Match the comma character , literally
(.*$) Capture any number of any character (except newline characters) until the asserted position at the end of the line (asserting end of line position dramatically reduces steps) into capture group 2
(?=[\s\S]*^\1,(.*)) Positive lookahead asserting what follows matches
[\s\S]* Match any number of any character (\s: any whitespace character; \S: any non-whitespace character)
^ Assert position at the start of a line
\1 Matches the same text as most recently matched by the 1st capturing group
, Matches the comma character , literally
(.*) Capture any number of any character into capture group 3