regex matching pair of brackets - regex

I'm trying to write a Sublime Text 2 syntax highlighter for Simulink's Target Language Compiler (TLC) files. This is a scripting language for auto-generating code. In TLC, the syntax to expand the contents of a token (similar to dereferencing a pointer in C or C++) is
%<token>
The regular expression I wrote to match this is
%<.+?>
This works for most cases, but fails for the following statement
%<LibAddToCommonIncludes("<string.h>")>
Modifying the regular expression to greedy fixes this if the statement is by itself on a line, but fails in several other cases. So that is not an option.
For that line, the highlighting stops at the first > instead of the second. How can I modify the regular expression to handle this case?
It'd be great if there was a general expression that could handle any number of nested <> pairs; for example
%<...<...>...<...<...>...>...>
where the dots are optional characters. The entire expression above should be a single match.

A generic way through regular expressions is difficult -as explained very well in this thread.
You can try to specifically match 2 < characters through a regex. Something like %<.+?<.+?>.+?>.

Related

Why JFlex reject .+?(?=->)

I'm trying design simple language idea plugin.
I want to match below example code as 3 tokens as text before ->, between -> and :, after :
Ex: First part -> Second Part: Third part
For first part when I try regex .+?(?=->) at https://regex101.com/r/TDBWg0/1 it works.
But as per JFlex .+?(?=->) has a syntax error:
Error in file "Simple.flex" (line 41):
Syntax error.
FIRST_PART=.+(?=(->))
^
Lexer generators like JFlex often have a different syntax and feature set than most other regex implementations, so helpers like regex101 aren't always that useful for them. Instead you should look at the JFlex manual to see which syntax JFlex supports.
There's two things of note there:
The syntax for lookahead is /regex not (?=regex)
There is no syntax for non-greedy quantifiers
< and > need to be quoted or escaped
So .+/"->" would be a valid regex, but when there are multiple ->s it will match up to the last ->, not the first. Presumably you tried to make the + non-greedy specifically so that it would only match up to the first, so this is no good.
Since there are no non-greedy modifiers in JFlex, we need a different approach. If we look at the available regex features again, we'll see that there's an operator ~, which works as follows:
~a (upto)
matches everything up to (and including) the first occurrence of a text matched by a. The expression ~a is equivalent to !([^]* a [^]*) a. A traditional C-style comment is matched by "/*" ~"*/".
So the regex you want is simpy ~"->".
Another approach, that works with virtually every regex implementation, would be to write a regex that specifically matches everything that's not a ->, i.e. any non-- character or a - not followed by a >. So that'd be:
([^-]|-[^\>])+

What regular expression variant is used in Visual Studio code?

I know that I can use Ruby's regular expressions in a tmLanguage file, however that seems not to be the case in other configuration files, e.g. for extensions. Take for example the firstLine value in the language contribution. I get errors when I use character classes (e.g. \s or \p{L}). Hence I wonder what is actually allowed there. How would you match whitespaces there?
Update:
After the comments I tried this:
"firstLine": "^(lexer|parser)?\\s*grammar\\w+;"
which is supposed to match a first line like lexer grammar G1; or just grammar G1;. Is there a way to test if that RE works, because I have no validation otherwise?
Update 2:
It's essential to use the correct grammar and it will magically work:
"firstLine": "^(lexer|parser)?\\s*grammar\\s*\\w+\\s*;"
.NET regular expressions use a syntax that is largely based on Perl 5, however it does add a few new features such as named capture groups and right to left matching, so the two should not be thought of as identical. Here is the full MSDN documentation for .NET regular expressions:
.NET Framework Regular Expressions
\s is a valid character class in .NET, but it is difficult to say exactly what the problem is without seeing the code you are trying. Andrew could be right, that you just did not escape the \.

Remove text between two characters (parenthesis) in a string

I'm working on a project and I want to remove text between two parentheses in a string.
Example:
std::string str = "I want to remove (this)."
How would I go about doing that?
I've searched google and stackoverflow an haven't found anything.
I'd use a regular expression for that. Check out the link I provided. As for the expression to use the following expression
(\()(?:[^\)\\]*(?:\\.)?)*\)
That guy worked for me.
Conditionally replace regex matches in string
Do not get regular and common expressions confused. This is not like the more common expression of :-) or :-O or >:( All-though effective These expressions are mutually exclusive expressions that not many languages understand but are more commonly used.

Match repetition with regexp in ocamllex

I'm trying to write a lexer with ocamllex for some special native language (that is a bit modified for my purposes). Some words shall be matched by their first char, that is doubled. But I dont find any way for express this repetition of the first char. Neither I can use the regex syntax
(['a'-'z'])\1['a'-'z']+
with that "\1". Ocamllex says "illegal escape sequence \1." and I think thats really okay with the syntax of escape expressions, but sure thats not what I wanted. Nor I can use the repetition syntax with curly braces in any way (but this wont solve the problem anyway):
['a'-'z']{2}['a'-'z']+
I think there is a conflict with the oCaml code in the curly braces after the regexp.
Does anybody have an idea for that?
thank you very much.
Ocamllex's regex doesn't have repetition syntax. The avaibable regex syntax is just as listed in reference manual:
http://caml.inria.fr/pub/docs/manual-ocaml-4.01/lexyacc.html#sec274
And I think you can manually list the all possible repetitions as below:
("aa"|"bb"|"cc"|"dd"|"ee"|"ff"| ..............)['a'-'z']+

Indent code inside brackets with regex

I would like to indent a code string with tabulation. The simple rule is that I must append a tabulation after each line feed inside braces "{}".
My trouble is for nested braces... here I need many tabulation to be exactly the number of nested braces.
Do you think it is possible to do with a regex replace?
It is impossible with to be done with regex [at least with standard regex, which stand for regular expressions for regular languages] because the language you are describing is irregular!
It is even impossible to know if there are the same number of { as } in a given string with regular language.
We can show that if this language is regular, using homomorphism we can create the language L={anbn} which is a known irregular language.