Regular expression. The regex fails when prohibited characters are on second line - regex

I have the follow regex.
^(?!.*&#.*)[\u00e1\u00c1\u00e9\u00c9\u00ed\u00cd\u00f3\u00d3\u00fa\u00da\u00f1\u00d1\u00fc\u00dc\u00ab\u00bb\u00bf\u00a1`\w\d\s\-'.,&#:;®?!()$#/‘’*“”"]+$
The issue is when the text below is entered on second line the regex does not catch the "&#" as not allowed character.
Regex does not find a match with input in on one line.
The combination of &# is not allowed.
However if the input is like below i.e &# is on second ine
The combination of
&# is not allowed.
It will allow. although the prohibited characters "&#" are still in the input.
Not sure what tweak is need for regex to work if these character are on secode line.

In your regex, replace (?!.*&#.*) by (?![\s\S]*&#[\s\S]*)
The dot . match any character except the new lines. That is your error.
That match all the space characters \s and that all the none-space characters \S. That mean that if you use the following character class [\s\S], you can match every single character.
Good success

Related

What is the regex to find lines WITHOUT a line break

I'm using SubtitleEdit and I'd like to locate all the lines that do not contain a line break.
Because lines containing a line break indicates they are bilingual, which I want.
But those that do not have line breaks are mono-lingual, and I'd like to quickly locate them all and delete them. TIA!
Alternatively, if there is a regex expression that can find lines which do not contain any English characters, that would also work.
The confusion here was caused by 2 facts:
What SubtitleEdit calls a line is actually a multiline, containing
newlines.
The newline displayed is not the one used internally (so it would never match <br>).
Solution 1:
Now that we have found out it uses either \r\n or just \n, we can write a regex:
(?-m)^(?!.*\r?\n)[\s\S]*$
Explanation:
(?-m) - turn off the multiline option (which is otherwise enabled).
^ - match from start of text
(?!.*\r?\n) - negative look ahead for zero or more of any characters followed by newline character(s) - (=Contains)
[\s\S]*$ - match zero or more of ANY character (including newline) - will match the rest of text.
In short: If we don't find newline characters, match everything.
Now replace with an empty string.
Solution 2:
If you want to match lines that doesn't have any English characters, you can use this:
(?-m)^(?![\s\S]*[a-zA-Z])[\s\S]*$
Explanation:
(?-m) - turn off the multiline option (which is otherwise enabled).
^ - match from start of text
(?![\s\S]*[a-zA-Z]) - negative look ahead for ANY characters followed by an English character.
[\s\S]*$ - match zero or more of ANY character (including newline) - will match the rest of text.
In short: If we don't find an English character, match everything.
Now replace with an empty string.
You should use regex assert. Given test lines:
something_1
some<br>thing_2
something_3<br>
<br>something_4
something_5
This is an expression that will match lines 1 and 5
^(?!.*<br>).*$
In this regular expression we have the negative lookahead assertion (?!.*<br>) that allows us to define what line is suitable for us

Regular expression couldn't match text when there is line break within it

I am trying to extract strings based on a Regular Expression, however when new line exists within string.. Regular Expression doesnt handle
Regular Expression -
^Test\s[0-9]-[0-9]:.+?(?=\.)
Expression is simple, it matches any string that starts with Test followed by space and digit-digit and : followed by any text until .
This finds the text messages like below
Test 1-8: This is first test.
Test 9-8: This is second test and is OK.
Test 5-1:This is Test 1,3 three.
However when there is a text with a line break in below, above regular expression doesn't work.
Test 9-8: This is second test
and is OK.
How should I handle this in my regular expression?
The . (used in .+?(?=\.)) does not match line break chars in non-POSIX regex engines (exact chars vary across regex libraries).
Use a negated character class [^.]+ here:
^Test\s[0-9]-[0-9]:[^.]+
See the regex demo
The [^.]+ matches any 1 or more characters (including line breaks) other than a literal dot.
To match "any character, including newlines" you can use the following: [\s\S], which means "any whitespace character and any non-whitespace character"... so effectively: everything.
Alternatively you could use the 's' flag: /^Test\s[0-9]-[0-9]:.+?(?=\.)/s. This will also include newlines for the dot.
The solution provided by #wiktor-stribiżew is more efficient though, so I would advice using that one.

How to match words and an empy string

Newbie of regex here! :D
I have to match the string "SOMETHING HERE" in this example:
DATA[SOMETHING HERE]
SOMETHINGHERE can be NULL (DATA[]) and I have to match it too.
SOMETHINGHERE can anything, carriage returns and line breaks included
You might be looking for DATA\[(.*)\], where
\[ escapes [ character, . is any character and .* means here can be zero or more any characters.
EDIT
I didn't was able to test it and I was sure it will work until I noticed this:
The dot matches a single character, without caring what that character is. The only exception are line break characters. In all regex flavors discussed in this tutorial, the dot does not match line breaks by default.
This exception exists mostly because of historic reasons. The first tools that used regular expressions were line-based. They would read a file line by line, and apply the regular expression separately to each line. The effect is that with these tools, the string could never contain line breaks, so the dot could never match them.
So . match almost all characters (excluding CR and LF). So you can use this:
DATA\[([^a]*[a]*)*\]
It's exactly: match character, which's not 'a' or 'a' (you can use any character here)

Regex to match anything

I know it seems a bit redundant but I'd like a regex to match anything.
At the moment we are using ^*$ but it doesn't seem to match no matter what the text.
I do a manual check for no text but the test view we use is always validated with a regex. However, sometimes we need it to validate anything using a regex. i.e. it doesn't matter what is in the text field, it can be anything.
I don't actually produce the regex and I'm a complete beginner with them.
The regex .* will match anything (including the empty string, as Junuxx points out).
The chosen answer is slightly incorrect, as it wont match line breaks or returns. This regex to match anything is useful if your desired selection includes any line breaks:
[\s\S]+
[\s\S] matches a character that is either a whitespace character (including line break characters), or a character that is not a whitespace character. Since all characters are either whitespace or non-whitespace, this character class matches any character. the + matches one or more of the preceding expression
^ is the beginning-of-line anchor, so it will be a "zero-width match," meaning it won't match any actual characters (and the first character matched after the ^ will be the first character of the string). Similarly, $ is the end-of-line anchor.
* is a quantifier. It will not by itself match anything; it only indicates how many times a portion of the pattern can be matched. Specifically, it indicates that the previous "atom" (that is, the previous character or the previous parenthesized sub-pattern) can match any number of times.
To actually match some set of characters, you need to use a character class. As RichieHindle pointed out, the character class you need here is ., which represents any character except newlines (and it can be made to match newlines as well using the appropriate flag). So .* represents * (any number) matches on . (any character). Similarly, .+ represents + (at least one) matches on . (any character).
I know this is a bit old post, but we can have different ways like :
.*
(.*?)

Vim RegEx: Match until blank line

I'm trying to write a RegEx that will match any line that contains ".wpd", and then match all lines after that until it reaches a blank line (including the blank line).
This is what I've tried:
/\v^.*.wpd\_.\{-}^\s*$
However, the non-greedy operator \{-} after the "all characters including new lines" character class \{-} doesn't seem to work. If I use
/\v^.*.wpd\_.*
that will match the next line containing ".wpd" and then all lines after that. However, as soon as I change the * to \{-}, it doesn't match anything at all.
What am I doing wrong? Thanks!
This one seems to work:
/\v^.*\.wpd\_.{-}\n\s*\n
You cannot use the atom ^ (same for $) inside the regexp, it has its special meaning only at the front (back); elsewhere, it's taken as the literal char. Use \n to match a newline inside the regexp, as shown by perreal's answer.
(?s)[^\n\r]*\.wpd(.*?)\n{2}
(?s) - Turn on 'dot matches line breaks' to search across lines
[^\n\r]* - Starting at the beginning of a line, match anything that's not a line break
.wpd - Match '.wpd'
(.*?) - Match anything, non-greedily, including line breaks ( because we turned on (?s) previously )
\n{2} - ... until you find two newlines in a row, which would be a blank line
:)
The following is a large supporting comment to #perreal's answer above as well as my own version of that answer which I find more intuitive.
Let's dissect the following regexp based on http://vimdoc.sourceforge.net/htmldoc/pattern.html#/magic
/\v^.*\.wpd\_.{-}\n\s*\n
\v (lowercase v): This is the 'very magic' operator which
signifies that in the pattern after it all ASCII characters except
'0'-'9', 'a'-'z', 'A'-'Z' and '_' have a special meaning.Therefore, characters like * , ^, $ need not be escaped in the pattern but for _ to have special meaning (such as modifying the behaviour of . to match newline), it needs to be escaped. Hence with \v set, you need \_ for the latter to have special meaning. To truly appreciate how much very magic simplifies the expression, compare it with the same expression using the very NOmagic(uppercase \V): /\V\^\.\*.wpd\_\.\{-}\n\s\*\n (very nomagic) vs /\v^.*\.wpd\_.{-}\n\s*\n (very magic)
^.*\.wpd: Greedily match anything (.*) from the beginning of a line (^) till .wpd
\_. : Matches a single character, which can be
any character including the newline. Note that with \v set, the pattern must have escaped underscore as noted above.
{-} : Is the non-greedy equivalent of * quantifier. So, where .*BLAH matches the most possible characters till BLAH, .{-}BLAH will match the least possible. To see this in action, take a look at this (in this case, I had to use ? instead of {-} since that regex is PCRE) :
\n\s*\n: Matches a blank line which may contain one or more spaces or tabs
\_.{-}\n\s*\n: combines the above two and means Match the least possible number of characters including newline (\_.) until a blank line (\n\s*\n)
\v^.*\.wpd\_.{-}\n\s*\n: Finally putting it altogether, set the very magic operator (possibly to allow simplifying the pattern by not needing to escape anything except an _ for special meaning), search for any line which contains .wpd and match until the closest blank line.
My version using variants of end-of-line start-of-line characters
The only modification is to the expression used to signify a blank line. I find it useful to define a blank line in terms of the start-of-line ('^') and end-of-line ('$') characters, however as-is, they cannot be used anywhere in a regexp except the beginning and the end respectively.
For the above use-case, there are variants which can be used anywhere in a regex, namely: '_^' and \_$ respectively. Therefore the blank line expression can be written as \_^\s*\_$ instead of \n\s*\n, thus making the complete expression:
\v^.*.wpd\_.{-}\_^\s*\_$
This perhaps is closer to answering the OP's question about why they were unable to use the start-of-line character in their expression.
Phew!