Match multiline PCRE until an exception - regex

Is it possible to use a regex to generate matches until a pattern is broken?
https://regex101.com/r/bRQkWM/1
(?m)(?=.*?\*)(\d+)|\*\w*.*$
In this instance, capture the digits at the start of the line, plus the rest of the line provided the line begins with a *.
If the line does not begin with a *, do not match digits or rest of line.
Thank you in advance!

The solution should be (link):
(?m)\G(\d+)\s+\*(\w*.*)(?:[\n\r]+|$)
However... The example you provided has broken pattern right in its first line, as there is no * in such line. That leads me to a conclusion that you wish to ignore all lines before the fist match. If that is your desired specification, then the solution should be (link):
(?m)\A(?:\d+\s+[^*]\w*.*$[\n\r]*)*|\G(\d+)\s+\*(\w*.*)(?:[\n\r]+|$)
This extended regex pattern will work even if there is no broken pattern before the first match.
Please keep in mind that the first match of this solution has to be ignored, as it contains those ignored lines before the first match, or it is empty if there are no lines needed to be ignored.
The key of the above solution(s) is a use of \G, the anchor that matches at the position where the previous match ended.

Related

Language Syntax Highlight - Comment Line Starts With * may or may not have following words

I am creating a syntax highlight file for a language and I have everything mapped out and working with one exception.
I cannot come up with a regex that will match the following conditions for a specific line comment style.
If the first non white-space character is an asterisk (*) the line is considered a comment.
I have created many samples that work in regexr but it never captures in vscode.
For example, regexr is cool with this:
^(?:\s*)\*+(?:.*)?\n
So I convert it into the proper format for the tmlanguage.json file:
^(?:\\s*)\\*+(?:.*)?\\n
But it is not capturing properly, if the first character of the line is an *, it does not catch, but if the first character is a whitespace character followed by an * it does work.
I suck at formatting on stackoverflow, so represents a chr(9) tab character. is a space.
*******************************
*****************************
<tab>*************************
* comment
* comment
<tab>* comment
But it shouldn't work in these cases:
string *******************************
string ***************************** string
<tab>string *************************
x *= 3
I am guessing that either the anchor ^ isn't working in my regex or I am escaping something incorrectly.
Any advice?
Please see sample image attached: screenshot
I don't know the regex engine you're using. I'm just going to give you some
general tips on how it should be done.
First off, if you're reading a string with more than 1 newline in it,
the anchor ^, in an engines default state means Beginning of String (BOS)
What you want in this case is Multi-Line-Mode. This makes the anchor ^ match at the Beginning of Line (BO
L) as well as the BOS.
Second, you don't need those non capture groups (?:\s*) (?:.*), they encapsulate single constructs.
Third, it is redundant to make a group optional when its enclosed contents are optional (?:.*)?
Fourth, you don't need the newline \n construct at the end, since it should not be highlighted anyway, and it might not be present on the last line of text.
The latter will make it not match.
So, putting it all together, the modified regex would be (?m)^\s*\*.*
Explained
(?m) # Inline modifier: Multi-line mode
^ # Beginning of line
\s* # Optional many whitespace
\* # Required at least a single asterisk
.* # Optional rest of non-newline characters
Note that you could put a single capture group around the data
if you need to reference it in a replace (?m)^(\s*\*.*)
Also, the language you're using should have a way to specify options when compiling the regex. If the engine doesn't accept inline modifiers (?m) take it out and specify that option when compiling the regex.
Apparently VS Code's syntax highlighter is single-line. No matter how much i tried matching regeces that are over several lines, these never worked.
Second, if you're designing a language I suggest you not to use an arithmetic operator for comments.
Third, apparently you can match newlines in the begin and end attributes. You can try it there.

Notepad++ Regex - Issue with ^ anchor and repeating patterns

When one tries to remove some characters from the start of a line and the anchored pattern can be found again after the first replace, it will be removed again.
For a very simple example given the input 012345, search pattern ^. and empty replacement, Notepad++ will remove the whole line when using replace all. This is most likely due to the case, that the cursor is still at the start of the line after the first replace and thus matches the ^ anchor again.
How can one ensure that only the actual first character is removed (in my case the expected output would be 12345)?
You can see my workaround in my answer, but maybe there is another nice trick to achieve it.
One can match the rest of the line, capture the match into a group and then use this group as replacement. The pattern in the question could be adjusted to ^.(.*) and be replaced by $1.
This will force the cursor to move forward in the string, so the ^ anchor can't match again.
Another workaround could be finding:
^.(.)?
and replacing it with:
\1
I'm sure this is a subject of a bug report but couldn't find it as of now. In N++:
Anchors are buggy
By Replace All functionality, replacements are supposed to not be a subject to re-matching. But they are, when replacement strings are invisible / zero-length characters.
Take care of them.

Regex negation in vim

In vim I would like to use regex to highlight each line that ends with a letter, that is preceeded by neither // nor :. I tried the following
syn match systemverilogNoSemi "\(.*\(//\|:\).*\)\#!\&.*[a-zA-Z0-9_]$" oneline
This worked very good on comments, but did not work on lines containing colon.
Any idea why?
Because with this regex vim can choose any point for starting match for your regular expression. Obviously it chooses the point where first concat matches (i.e. does not have // or :). These things are normally done by using either
\v^%(%(\/\/|\:)#!.)*\w$
(removed first concat and the branch itself, changed .* to %(%(\/\/|\:)#!.)*; replaced collection with equivalent \w; added anchor pointing to the start of line): if you need to match the whole line. Or negative look-behind if you need to match only the last character. You can also just add anchor to the first concat of your variant (you should remove trailing .* from the first concat as it is useless, and the branch symbol for the same reason).
Note: I have no idea why your regex worked for comments. It does not work with comments the way you need it in all cases I checked.
does this work for you?
^\(\(//\|:\)\#<!.\)*[a-zA-Z0-9_]$

RegExp adaption with new line

I've the following RegExp to find the URIs listed above:
"^w{3}\.[\S\-\n|\S]+[^\s.!?,():]+$"
URLs to find:
www.example.org
www.example-example.org
www.example-example.org/product
You'll find it at www.example-
example.org/product.
www.example.org
You'll find it there.
Number 1, 2 and 3 will be found, but 4. delivers "www.example-" as URI.
When there is no point at the end of 4. it would deliver it correct.
EDIT: With deleting ^ and $ only number 5 is not working.
Does anyone can help here?
Your pattern
^w{3}\.[\S\-\n|\S]+[^\s.!?,():]+$
can be simplified to
^w{3}\.[\S\n]+[^\s.!?,():]$
[\S\-\n|\S] this is a character class, no OR possible, no repetition needed, - is included in \S. So [\S\n] is doing the same.
[^\s.!?,():]+ because you match every non whitespace with the expression before this one, here the + is not needed. I assume you just want your pattern not to end with one of the characters from the class.
See your pattern on Regexr (I added \r to your first class, because the line breaks there needs it)
This is a very useful tool to test regexes
I think your problem is that you want to allow line breaks in the link. How do you want to handle this? How do you want to distinguish when the line ends with a link if the word in the next line is just a word or part of the link. I think this is not possible!
The problem is the '^\s' in the second squared bracketed part. Depending on your programming language, '\s' might match the new line. So, you are telling it to match anything that is not a whitespace and it finds a whitespace (new line).
However, this should only be one of your issues. Your regex uses the '^' and '$' characters which mean start and end of line respectively. Try this URL example:
hello from www.example.org
Did it match? I think it will not.

Regex Searching in vim

I'm using vim to do some pattern matching on a text file. I've enabled search highlighting so that I know exactly what is getting matched on each search and am getting confused.
Consider searching for [a-z]* on the following text:123456789abcdefghijklmnopqrstuvwxyxz987654321ABCDEFGHIJKLMNOPQRSTUVWQXZ
I expected this search to match zero or more consecutive characters that are in the range [a-z]. Instead, I get a match on the entire line.
Should this be the expected behaviour?
Thanks,
Andrew
It's matching the empty strings that occur after every character. It has no way of highlighting empty ranges, so it looks like everything is highlighted.
Try searching for [a-z]\+ instead.
Empty string matches [a-z]*... therefore this thing is matching everywhere. Perhaps you want to cut down some of the cases by doing [a-z]+ (1 or more), or [a-z]{4,} (4 or more).
You're not getting a match on the entire line, you're getting a match on every character. Your pattern also matches nothing at all, which is matched by every single character.