Vim RegEx: Match until blank line - regex

I'm trying to write a RegEx that will match any line that contains ".wpd", and then match all lines after that until it reaches a blank line (including the blank line).
This is what I've tried:
/\v^.*.wpd\_.\{-}^\s*$
However, the non-greedy operator \{-} after the "all characters including new lines" character class \{-} doesn't seem to work. If I use
/\v^.*.wpd\_.*
that will match the next line containing ".wpd" and then all lines after that. However, as soon as I change the * to \{-}, it doesn't match anything at all.
What am I doing wrong? Thanks!

This one seems to work:
/\v^.*\.wpd\_.{-}\n\s*\n

You cannot use the atom ^ (same for $) inside the regexp, it has its special meaning only at the front (back); elsewhere, it's taken as the literal char. Use \n to match a newline inside the regexp, as shown by perreal's answer.

(?s)[^\n\r]*\.wpd(.*?)\n{2}
(?s) - Turn on 'dot matches line breaks' to search across lines
[^\n\r]* - Starting at the beginning of a line, match anything that's not a line break
.wpd - Match '.wpd'
(.*?) - Match anything, non-greedily, including line breaks ( because we turned on (?s) previously )
\n{2} - ... until you find two newlines in a row, which would be a blank line
:)

The following is a large supporting comment to #perreal's answer above as well as my own version of that answer which I find more intuitive.
Let's dissect the following regexp based on http://vimdoc.sourceforge.net/htmldoc/pattern.html#/magic
/\v^.*\.wpd\_.{-}\n\s*\n
\v (lowercase v): This is the 'very magic' operator which
signifies that in the pattern after it all ASCII characters except
'0'-'9', 'a'-'z', 'A'-'Z' and '_' have a special meaning.Therefore, characters like * , ^, $ need not be escaped in the pattern but for _ to have special meaning (such as modifying the behaviour of . to match newline), it needs to be escaped. Hence with \v set, you need \_ for the latter to have special meaning. To truly appreciate how much very magic simplifies the expression, compare it with the same expression using the very NOmagic(uppercase \V): /\V\^\.\*.wpd\_\.\{-}\n\s\*\n (very nomagic) vs /\v^.*\.wpd\_.{-}\n\s*\n (very magic)
^.*\.wpd: Greedily match anything (.*) from the beginning of a line (^) till .wpd
\_. : Matches a single character, which can be
any character including the newline. Note that with \v set, the pattern must have escaped underscore as noted above.
{-} : Is the non-greedy equivalent of * quantifier. So, where .*BLAH matches the most possible characters till BLAH, .{-}BLAH will match the least possible. To see this in action, take a look at this (in this case, I had to use ? instead of {-} since that regex is PCRE) :
\n\s*\n: Matches a blank line which may contain one or more spaces or tabs
\_.{-}\n\s*\n: combines the above two and means Match the least possible number of characters including newline (\_.) until a blank line (\n\s*\n)
\v^.*\.wpd\_.{-}\n\s*\n: Finally putting it altogether, set the very magic operator (possibly to allow simplifying the pattern by not needing to escape anything except an _ for special meaning), search for any line which contains .wpd and match until the closest blank line.
My version using variants of end-of-line start-of-line characters
The only modification is to the expression used to signify a blank line. I find it useful to define a blank line in terms of the start-of-line ('^') and end-of-line ('$') characters, however as-is, they cannot be used anywhere in a regexp except the beginning and the end respectively.
For the above use-case, there are variants which can be used anywhere in a regex, namely: '_^' and \_$ respectively. Therefore the blank line expression can be written as \_^\s*\_$ instead of \n\s*\n, thus making the complete expression:
\v^.*.wpd\_.{-}\_^\s*\_$
This perhaps is closer to answering the OP's question about why they were unable to use the start-of-line character in their expression.
Phew!

Related

What is the regex to find lines WITHOUT a line break

I'm using SubtitleEdit and I'd like to locate all the lines that do not contain a line break.
Because lines containing a line break indicates they are bilingual, which I want.
But those that do not have line breaks are mono-lingual, and I'd like to quickly locate them all and delete them. TIA!
Alternatively, if there is a regex expression that can find lines which do not contain any English characters, that would also work.
The confusion here was caused by 2 facts:
What SubtitleEdit calls a line is actually a multiline, containing
newlines.
The newline displayed is not the one used internally (so it would never match <br>).
Solution 1:
Now that we have found out it uses either \r\n or just \n, we can write a regex:
(?-m)^(?!.*\r?\n)[\s\S]*$
Explanation:
(?-m) - turn off the multiline option (which is otherwise enabled).
^ - match from start of text
(?!.*\r?\n) - negative look ahead for zero or more of any characters followed by newline character(s) - (=Contains)
[\s\S]*$ - match zero or more of ANY character (including newline) - will match the rest of text.
In short: If we don't find newline characters, match everything.
Now replace with an empty string.
Solution 2:
If you want to match lines that doesn't have any English characters, you can use this:
(?-m)^(?![\s\S]*[a-zA-Z])[\s\S]*$
Explanation:
(?-m) - turn off the multiline option (which is otherwise enabled).
^ - match from start of text
(?![\s\S]*[a-zA-Z]) - negative look ahead for ANY characters followed by an English character.
[\s\S]*$ - match zero or more of ANY character (including newline) - will match the rest of text.
In short: If we don't find an English character, match everything.
Now replace with an empty string.
You should use regex assert. Given test lines:
something_1
some<br>thing_2
something_3<br>
<br>something_4
something_5
This is an expression that will match lines 1 and 5
^(?!.*<br>).*$
In this regular expression we have the negative lookahead assertion (?!.*<br>) that allows us to define what line is suitable for us

Regular expression. The regex fails when prohibited characters are on second line

I have the follow regex.
^(?!.*&#.*)[\u00e1\u00c1\u00e9\u00c9\u00ed\u00cd\u00f3\u00d3\u00fa\u00da\u00f1\u00d1\u00fc\u00dc\u00ab\u00bb\u00bf\u00a1`\w\d\s\-'.,&#:;®?!()$#/‘’*“”"]+$
The issue is when the text below is entered on second line the regex does not catch the "&#" as not allowed character.
Regex does not find a match with input in on one line.
The combination of &# is not allowed.
However if the input is like below i.e &# is on second ine
The combination of
&# is not allowed.
It will allow. although the prohibited characters "&#" are still in the input.
Not sure what tweak is need for regex to work if these character are on secode line.
In your regex, replace (?!.*&#.*) by (?![\s\S]*&#[\s\S]*)
The dot . match any character except the new lines. That is your error.
That match all the space characters \s and that all the none-space characters \S. That mean that if you use the following character class [\s\S], you can match every single character.
Good success

Understanding regex in shell

I came across single grouping concept in shell script.
cat employee.txt
101,John Doe,CEO
I was practising SED substitute command and came across with below example.
sed 's/\([^,]*\).*/\1/g' employee.txt
It was given that above expression matches the string up to the 1st comma.
I am unable to understand how this matches the 1st comma.
Below is my understanding
s - substitute command
/ delimiter
\ escape character for (
( opening braces for grouping
^ beginning of the line - anchor
[^,] - i am confused in this , is it negate of comma or mean something else?
why * and again .* is used to match the string up to 1st comma?
^ matches beginning of line outside of a character class []. At the beginning of a character class, it means negation.
So, it says: non-comma ([^,]) repeated zero or more times (*) followed by anything (.*). The matching part of the string is replaced by the part before the comma, so it removes everything from the first comma onward.
I know 'link only' answers are to be avoided - Choroba has correctly pointed out that this is:
non-comma ([^,]) repeated zero or more times () followed by anything (.). The matching part of the string is replaced by the part before the comma, so it removes everything from the first comma onward.
However I'd like to add that for this sort of thing, I find regulex quite a useful tool for visualising what's going on with a regular expression.
The image representation of your regular expression is:
Given the string "foo, bar", s/\([^,]*\).*/\1/g, and more specifically \([^,]\)*) means, "match any character that is not a comma" (zero or more times). Since "f" is not a comma, it matches "f" and "remembers" it. Because it is "zero or more times", it tries again. The next character is not a comma either (it is o), then, the regex engine adds that o to the group as well. The same thing happens for the 2nd o.
The next character is indeed a comma, but [^,] forbids it, as #choroba affirmed. What is in the group now is "foo". Then, the regex uses .* outside the group which causes zero or more characters to be matched but not remembered.
In the replacement part of the regex, \1 is used to place the contents of the remembered text ("foo"). The rest of the matched text is lost and that is how you remain with only the text up to the first comma.

How to match words and an empy string

Newbie of regex here! :D
I have to match the string "SOMETHING HERE" in this example:
DATA[SOMETHING HERE]
SOMETHINGHERE can be NULL (DATA[]) and I have to match it too.
SOMETHINGHERE can anything, carriage returns and line breaks included
You might be looking for DATA\[(.*)\], where
\[ escapes [ character, . is any character and .* means here can be zero or more any characters.
EDIT
I didn't was able to test it and I was sure it will work until I noticed this:
The dot matches a single character, without caring what that character is. The only exception are line break characters. In all regex flavors discussed in this tutorial, the dot does not match line breaks by default.
This exception exists mostly because of historic reasons. The first tools that used regular expressions were line-based. They would read a file line by line, and apply the regular expression separately to each line. The effect is that with these tools, the string could never contain line breaks, so the dot could never match them.
So . match almost all characters (excluding CR and LF). So you can use this:
DATA\[([^a]*[a]*)*\]
It's exactly: match character, which's not 'a' or 'a' (you can use any character here)

Regex to match anything

I know it seems a bit redundant but I'd like a regex to match anything.
At the moment we are using ^*$ but it doesn't seem to match no matter what the text.
I do a manual check for no text but the test view we use is always validated with a regex. However, sometimes we need it to validate anything using a regex. i.e. it doesn't matter what is in the text field, it can be anything.
I don't actually produce the regex and I'm a complete beginner with them.
The regex .* will match anything (including the empty string, as Junuxx points out).
The chosen answer is slightly incorrect, as it wont match line breaks or returns. This regex to match anything is useful if your desired selection includes any line breaks:
[\s\S]+
[\s\S] matches a character that is either a whitespace character (including line break characters), or a character that is not a whitespace character. Since all characters are either whitespace or non-whitespace, this character class matches any character. the + matches one or more of the preceding expression
^ is the beginning-of-line anchor, so it will be a "zero-width match," meaning it won't match any actual characters (and the first character matched after the ^ will be the first character of the string). Similarly, $ is the end-of-line anchor.
* is a quantifier. It will not by itself match anything; it only indicates how many times a portion of the pattern can be matched. Specifically, it indicates that the previous "atom" (that is, the previous character or the previous parenthesized sub-pattern) can match any number of times.
To actually match some set of characters, you need to use a character class. As RichieHindle pointed out, the character class you need here is ., which represents any character except newlines (and it can be made to match newlines as well using the appropriate flag). So .* represents * (any number) matches on . (any character). Similarly, .+ represents + (at least one) matches on . (any character).
I know this is a bit old post, but we can have different ways like :
.*
(.*?)