Regular expression to capture multiple lines - regex

I have some text like this:
Note: this is example text so the content is unimportant
CAT SAT ON A DOG
REASON: No reason
CONCERN: He was cold
BECAUSE: Cold weather
CAT SAT ON A MOUSE
REASON: He eats mice
CONCERN: He was hungry
BECAUSE: Can opener didn't work
CAT SAT ON A HORSE
REASON: He wants to ride
CONCERN: He might fall off
BECAUSE: Saddle is too big
I am trying to write a regular expression that could capture only the 'CAT SAT ON A MOUSE' part, but am having problems capturing the full text.
I have tried:
(\bCAT\sSAT\sON\sA\sMOUSE)(.*)\n{2}
The idea was to match the beginning part of the string and then to capture everything up till two line breaks.
{2} is to capture the two line breaks.
I have tried many more variations but all I manage to do is to capture the first line only.
Any sort of help would be really appreciated.

You were asking for anything then two line breaks.
You needed to ask for a line break followed by anything twice.
Try this one:
(\bCAT\sSAT\sON\sA\sMOUSE)(\n.*){2}

I think your main problem is that your text uses \r\n to separate lines, and you're only looking for \n. Try this:
/^(CAT +SAT +ON +A +MOUSE)(?:(?:\r\n|[\r\n])[^\r\n]+)*/m
(?:\r\n|[\r\n]) matches any of the three most common line separators (which I'll call newlines): \r\n, \r, or \n. It matches exactly one newline at a time, no matter which kind it is. Then [^\r\n]+ takes over, so there can only be one line separator per line. Since paragraphs are delimited by two newlines, the match ends there.
I took the liberty of anchoring the first line with a start anchor (^) in multiline mode (m). It's not absolutely necessary to do that, but helps the regex find a match more quickly, and much importantly, to fail more quickly when no match is possible.
(You haven't said which regex flavor you're working with, so I made a wild guess and used JavaScript syntax.)

What language are you working with? That'll help a bit. In Perl, you can add the m specifier to treat the multi-lined string as a single piece of text:
#! /usr/bin/perl
my $string =<<STRING;
CAT SAT ON A MOUSE
REASON: He eats mice
CONCERN: He was hungry
BECAUSE: Can opener didn't work
This is a test, and not part of the string to match.
STRING
if ($string =~ /(^(CAT[^\n]+).*\n\n/s) {
say "Match: $1";
}
else {
say "Didn't match";
}
In Perl, adding the s on the end treats the enter string as a single line.

This might work:
(\bCAT[^\S\n]SAT[^\S\n]ON[^\S\n]A[^\S\n]MOUSE\b[\s\S]*?)\n{2}
or
(\bCAT[^\S\n]+SAT[^\S\n]+ON[^\S\n]+A[^\S\n]+MOUSE\b[\s\S]*?)\n{2}
Edit - The regex must be slowed after the first anchor, otherwise the next anchor
could be passed up in favor of speed. This can be done with a non-greedy quantifier
or a look-ahead assertion (which allows aggressive behavior at the cost of a check
that basically nullifies its speed).
Edit2 - Sometimes it may be desireable to match an 'apparent' gap between paragraphs that could include non-newline whitespace.
For example \n\n will not match an apparent gap like this:
'start ... \nend of paragraph\n \n' when it should.
In that case, replacing \n{2} with \n[^\S\n]*\n will allow it to match.
Furthermore, since the non-greedy quantifier is used (in this case) \b[\s\S]*?,
it is possible to account for and match the paragraph end when it is at or near the end of file. Putting this all together yeilds:
/(\bCAT\s+SAT\s+ON\s+A\s+MOUSE\b[\s\S]*?)($|\n[^\S\n]*\n)/
which now looks pretty complicated, but does the complete job.

Related

How to multiline regex but stop after first match?

I need to match any string that has certain characteristics, but I think enabling the /m flag is breaking the functionality.
What I know:
The string will start and end with quotation marks.
The string will have the following words. "the", "fox", and "lazy".
The string may have a line break in the middle.
The string will never have an at sign (used in the regex statement)
My problem is, if I have the string twice in a single block of text, it returns once, matching everything between the first quote mark and last quote mark with the required words in-between.
Here is my regex:
/^"the[^#]*fox[^#]*lazy[^#]*"$/gim
And a Regex101 example.
Here is my understanding of the statement. Match where the string starts with "the and there is the word fox and lazy (in that order) somewhere before the string ends with ". Also ignore newlines and case-sensitivity.
The most common answer to limiting is (.*?) But it doesn't work with new lines. And putting [^#?]* doesn't work because it adds the ? to the list of things to ignore.
So how can I keep the "match everything until ___" from skipping until the last instance while still being able to ignore newlines?
This is not a duplicate of anything else I can find because this deals with multi-line matching, and those don't.
In your case, all your quantifiers need to be non-greedy so you can just use the flag ungreedy: U.
/^"the[^#]*fox[^#]*lazy[^#]*"$/gimU
Example on Regex101.
The answer, which was figured out while typing up this question, may seem ridiculously obvious.
Put the ? after the *, not inside the brackets. Parenthesis and Brackets are not analogous, and the ? should be relative to the *.
Corrected regex:
/^"the[^#]*?fox[^#]*?lazy[^#]*?"$/gim
Example from Regex101.
The long and the short of this is:
Non-greedy, multi-line matching can be achieved with [^#]*?
(substituting # for something you don't want to match)

How to search and replace from the last match of a until b?

I have a latex file in which I want to get rid of the last \\ before a \end{quoting}.
The section of the file I'm working on looks similar to this:
\myverse{some text \\
some more text \\}%
%
\myverse{again some text \\
this is my last line \\}%
\footnote{possibly some footnotes here}%
%
\end{quoting}
over several hundred lines, covering maybe 50 quoting environments.
I tried with :%s/\\\\}%\(\_.\{-}\)\\end{quoting}/}%\1\\end{quoting}/gc but unfortunately the non-greedy quantifier \{-} is still too greedy.
It catches starting from the second line of my example until the end of the quoting environment, I guess the greedy quantifier would catch up to the last \end{quoting} in the file. Is there any possibility of doing this with search and replace, or should I write a macro for this?
EDIT: my expected output would look something like this:
this is my last line }%
\footnote{possibly some footnotes here}%
%
\end{quoting}
(I should add that I've by now solved the task by writing a small macro, still I'm curious if it could also be done by search and replace.)
I think you're trying to match from the last occurrence of \\}% prior to end{quoting}, up to the end{quoting}, in which case you don't really want any character (\_.), you want "any character that isn't \\}%" (yes I know that's not a single character, but that's basically it).
So, simply (ha!) change your pattern to use \%(\%(\\\\}%\)\#!\_.\)\{-} instead of \_.\{-}; this means that the pattern cannot contain multiple \\}% sequences, thus achieving your aims (as far as I can determine them).
This uses a negative zero-width look-ahead pattern \#! to ensure that the next match for any character, is limited to not match the specific text we want to avoid (but other than that, anything else still matches). See :help /zero-width for more of these.
I.e. your final command would be:
:%s/\\\\}%\(\%(\%(\\\\}%\)\#!\_.\)\{-}\)\\end{quoting}/}%\1\\end{quoting}/g
(I note your "expected" output does not contain the first few lines for some reason, were they just omitted or was the command supposed to remove them?)
You’re on the right track using the non-greedy multi. The Vim help files
state that,
"{-}" is the same as "*" but uses the shortest match first algorithm.
However, the very next line warns of the issue that you have encountered.
BUT: A match that starts earlier is preferred over a shorter match: "a{-}b" matches "aaab" in "xaaab".
To the best of my knowledge, your best solution would be to use the macro.

Matching line without and with lower-case letters

I want to match two consecutive lines, with the first line having no lower-case letter and the second having lower-case letter(s), e.g.
("3.2 A MEMORY ABSTRACTION: ADDRESS SPACES 177" "#205")
("3.3.1 Paging 187" "#215")
Why would the Regex ^(?!.*[:lower:]).*$\n^(.*[:lower:]).*$ match each of the following two-line examples?
("1.3.3 Disks 24" "#52")
("1.3.4 Tapes 25" "#53")
("1.5.4 Input/Output 41" "#69")
("1.5.5 Protection 42" "#70")
("3.1 NO MEMORY ABSTRACTION 174" "#202")
("3.2 A MEMORY ABSTRACTION: ADDRESS SPACES 177" "#205")
("3.3.1 Paging 187" "#215")
("3.3.2 Page Tables 191" "#219")
Thanks and regards!
ADDED:
For a example such as:
("3.1 NO MEMORY ABSTRACTION 174" "#202")
("3.2 A MEMORY ABSTRACTION: ADDRESS SPACES 177" "#205")
("3.3.1 Paging 187" "#215")
("3.3.2 Page Tables 191" "#219")
How shall I match only the middle two lines not the first three lines or all the four lines?
To use a POSIX "character class" like [:lower:], you have to enclose it in another set of square brackets, like this: [[:lower:]]. (According to POSIX, the outer set of brackets form a bracket expression and [:lower:] is a character class, but to everyone else the outer brackets define a character class and the inner [:lower:] is obsolete.)
Another problem with your regex is that the first part is not required to consume any characters; everything is optional. That means your match can start on the blank line, and I don't think you want that. Changing the second .* to .+ fixes that, but it's just a quick patch.
This regex seems to match your specification:
^(?!.*[[:lower:]]).+\n(?=.*[[:lower:]]).*$
But I'm a little puzzled, because there's nothing in your sample data that matches. Is there supposed to be?
Using Rubular, we can see what's matched by your initial expression, and then, by adding a few excess capturing groups, see why it matches.
Essentially, the negative look-ahead followed by .* will match anything. If you merely want to check that the first line has no lower-case letters, check that explicitly, e.g.
^(?:[^a-z]+)$
Finally, I'd assuming you want the entire second line, you can do this for the second part:
^(.*?(?=[:lower:]).*?)$
Or to match your inital version:
^(.*?(?=[:lower:])).*?$
The reluctant qualifiers (*?) seemed to be necessary to avoid matching across lines.
The final version I ended up with, thus, is:
^(?:[^a-z]+)$\n^(.*?(?=[:lower:]).*?)$
This can be seen in action with your test data here. It only captures the line ("3.2 A MEMORY ABSTRACTION: ADDRESS SPACES 177" "#205").
Obviously, the regex I've used might be quite specific to Ruby, so testing with your regex engine may be somewhat different. There are many easily Google-able online regex tests, I just picked on Rubular since it does a wonderful job of highlighting what is being matched.
Incidentally, if you're using Python, the Python Regex Tool is very helpful for online testing of Python regexes (and it works with the final version I gave above), though I find the output visually less helpful in trouble-shooting.
After thinking about it a little more, Alan Moore's point about [[:lower:]] is spot on, as is his point about how the data would match. Looking back at what I wrote, I got a little too involved in breaking-down the regex and missed something about the problem as described. If you modify the regex I gave above to:
^(?:[^[:lower:]]+)$\n^(.*?(?=[[:lower:]]).*?)$
It matches only the line ("3.3.1 Paging 187" "#215"), which is the only line with lowercase letters following a line with no lowercase letters, as can be seen here. Placing a capturing group in Alan's expression, yielding ^(?!.*[[:lower:]]).+\n((?=.*[[:lower:]]).*)$ likewise captures the same text, though what, exactly, is matched is different.
I still don't have a good solution for matching multiple lines.

Regular Expression Using the Dot-Matches-All Mode

Normally the . doesn't match newline unless I specify the engine to do so with the (?s) flag. I tried this regexp on my editor's (UltraEdit v14.10) regexp engine using Perl style regexp mode:
(?s).*i
The search text contains multiple lines and each line contains many 'i' characters.
I expect the above regexp means: search as many characters (because with the '?s' the . now matches anything including newline) as possible (because of the greediness for *) until reaching the character 'i'.
This should mean "from the first character to the last 'i' in the last sentence" (greediness should reach the last sentence, right?).
But with UltraEdit's test, it turns out to be "from the first character to the last 'i' in the first sentence that contains an i". Is this result correct? Did I make any wrong interpretation of my reg expression?
e.g. given this text
aaa
bbb
aiaiaiaiaa
bbbicicid
it is
aaa
bbb
aiaiaiai
matched. But I expect:
aaa
bbb
aiaiaiaiaa
bbbicici
Your regex is correct, and so are your expectations of its performance.
This is a long-known bug in UltraEdit's regex implementation which I have written repeatedly to support about. As far as I know, it still hasn't been fixed. The problem appears to lie in the fact that UE's regex implementation is essentially line-based, and additional lines are taken into the match only if necessary. So .* will match greedily on the current line, but it will not cross a newline boundary if it doesn't have to in order to achieve a match.
There are some other subtle bugs with line endings. For example, lookbehind doesn't work across newlines, either.
Write to IDM support, or change to an editor with decent regex support. I did both.
Yes you are right this looks like a bug.
Your interpretation is correct. If you are in Perl mode and not Posix.
However it should apply to posix as well.
Altough defining the modifiers like you do is very rare.
Mostly you provide a string with delimiters and the modifier afterwards like /.*i/s
But this doesn't matter because your way is correct too. And if it wouldnt be supported, it wouldn't match the first newline either.
So yes, this is definately a bug in your program.
You're right that that regex should match the entire string (all 4 lines). My guess is that UltraEdit is attempting to do some sort of optimization by working line by line, and only accumulating new lines "when necessary".

Remove stuff, retrieve numbers, retrieve text with spaces in place of dots, remove the rest

This is my first question, so I hope I didn't mess too much with the title and the formatting.
I have a bunch of file a client of mine sent me in this form:
Name.Of.Chapter.021x212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext
What I need is a regex to output just:
212 The Actual Title Of the Chapter
I'm not gonna use it with any script language in particular; it's a batch renaming of files through an app supporting regex (which already "preserves" the extension).
So far, all I was able to do was this:
/.*x(\d+)\.(.*?)\.[A-Z]{3}.*/ -->REPLACE: $1 $2
(Capture everything before a number preceded by an "x", group numbers after the "x", group everything following until a 3 digit Uppercase word is met, then capture everything that follows)
which gives me back:
212 The.Actual.Title.Of.the.Chapter
Having seen the result I thought that something like:
/.*x(\d+)\.([^.]*?)\.[A-Z]{3}.*/ -->REPLACE: $1 $2
(Changed second group to "Capture everything which is not a dot...") would have worked as expected.
Instead, the whole regex fails to match completely.
What am I missing?
TIA
cià
ale
.*x(\d+)\. matches Name.Of.Chapter.021x212.
\.[A-Z]{3}.* matches .DOC.NAME-Some.stuff.Here.ext
But ([^.]*?) does not match The.Actual.Title.Of.the.Chapter because this regex does not allow for any periods at all.
since you are on Mac, you could use the shell
$ s="Name.Of.Chapter.021x212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext"
$ echo ${s#*x}
212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext
$ t=${s#*x}
$ echo ${t%.[A-Z][A-Z][A-Z].*}
212.The.Actual.Title.Of.the.Chapter
Or if you prefer sed, eg
echo $filename | sed 's|.[^x]*x||;s/\.[A-Z][A-Z][A-Z].*//'
For processing multiple files
for file in *.ext
do
newfile=${file#*x}
newfile=${newfile%.[A-Z][A-Z][A-Z].*}
# or
# newfile=$(echo $file | sed 's|.[^x]*x||;s/\.[A-Z][A-Z][A-Z].*//')
mv "$file" "$newfile"
done
To your question "How can I remove the dots in the process of matching?" the answer is "You can't." The only way to do that is by processing the result of the match in a second step, as others have said. But I think there's a more basic question that needs to be addressed, which is "What does it mean for a regex to match a given input?"
A regex is usually said to match a string when it describes any substring of that string. If you want to be sure the regex describes the whole string, you need to add the start (^) and end ($) anchors:
/^.*x(\d+)\.(.*?)\.[A-Z]{3}.*$/
But in your case, you don't need to describe the whole string; if you get rid of the .* at either end, it will serve your just as well:
/x(\d+)\.(.*?)\.[A-Z]{3}/
I recommend you not get in the habit of "padding" regexes with .* at beginning and end. The leading .* in particular can change the behavior of the regex in unexpected ways. For example, it there were two places in the input string where x(\d+)\. could match, your "real" match would have started at the second one. Also, if it's not anchored with ^ or \A, a leading .* can make the whole regex much less efficient.
I said "usually" above because some tools do automatically "anchor" the match at the beginning (Python's match()) or at both ends (Java's matches()), but that's pretty rare. Most of the shells and command-line tools available on *nix systems define a regex match in the traditional way, but it's a good idea to say what tool(s) you're using, just in case.
Finally, a word or two about vocabulary. The parentheses in (\d+) cause the matched characters to be captured, not grouped. Many regex flavors also support non-capturing parentheses in the form (?:\d+), which are used for grouping only. Any text that is included in the overall match, whether it's captured or not, is said to have been consumed (not captured). The way you used the words "capture" and "group" in your question is guaranteed to cause maximum confusion in anyone who assumes you know what you're talking about. :D
If you haven't read it yet, check out this excellent tutorial.