Regular Expressions (Normal OR Nested Brackets)

Regular Expressions (Normal OR Nested Brackets) - regex

So I'm completely new to the overwhelming world of Regex. Basically, I'm using the Gedit API to create a new custom language specification (derived from C#) for syntax-highlighting (for DM from Byond). In escaped characters in DM, you have to use [variable] as an escaping syntax, which is simple enough. However, it could also be nested, such as [array/list[index]] for instance. (It could be nested infinitely.) I've looked through the other questions, and when they ask about nested brackets they only mean exclusively nested, whereas in this case it could be either/or.
Several attempts I've tried:
\[.*\] produces the result "Test [Test[Test] Test]Test[Test] Test"
\[.*?\] produces the result "Test [Test[Test] Test]Test [Test] Test"
\[(?:.*)\] produces the result "Test [Test[Test] Test]Test[Test] Test"
\[(?:(?!\[|\]).)*\] produces the result "Test [Test[Test] Test]Test[Test] Test". This is derived from https://stackoverflow.com/a/9580978/2303154 but like mentioned above, that only matches if there are no brackets inside.
Obviously I've no real idea what I'm doing here in more complex matching, but at least I understand more of the basic operations from other sources.

From #Chaos7Theory:
Upon reading GtkSourceView's Specification Reference, I've figured out that it uses PCRE specifically. I then used that as a lead.
Digging into it and through trial-and-error, I got it to work with:
\[(([^\[\]]*|(?R))*)\]
I hope this helps someone else in the future.

Related

OpenModelica SimulationOptions 'variableFilter' not working with '^' exceptions

To reduce size of my simulation output files, I want to give variable name exceptions instead of a list of many certain variables to the simulationsOptions/outputFilter (cf. OpenModelica Users Guide / Output) of my model. I found the regexp operator "^" to fullfill my needs, but that didn't work as expected. So I think that something is wrong with the interpretation of connected character strings when negated.
Example:
When I have any derivatives der(...) in my model and use variableFilter=der.* the output file will contain all the filtered derivatives. Since there are no other varibles beginning with character d the same happens with variableFilter=d.*. For testing I also tried variableFilter=rde.* to confirm that every variable is filtered.
When I now try to except by variableFilter=^der.*, =^rde.* or =^d.*, I get exactly the same result as without using ^. So the operator seems to be ignored in this notation.
When I otherwise use variableFilter=[^der].*, =[^rde].* or even =[^d].*, all wanted derivation variables are filtered from the ouput, but there is no difference between those three expressions above. For me it seems that every character is interpretated standalone and not as as a connected string.
Did I understand and use the regexp usage right or could this be a code bug?
Side/follow-up question: Where can I officially report this for software revision?
_
OpenModelica v.1.19.2 (64-bit)

Scanning a language with non-delimited strings with nested tokens

I want to create a lexer/parser for a language that has non-delimited strings.
Which part of the language is a string is defined by the command preceding it.
For example it has statements that look like this:
pause 5
alert Hello world[CRLF] this contains 'pause' once (1)
Alert in this instance can end with any string, including keywords and numbers.
Further complicating things, the text can contain tags like [CRLF] that I want to separate too.
Ideally I'd want this to be broken up into:
[PAUSE][INT 5]
[ALERT][STR "Hello world"][CRLF][STR " this contains 'pause' once (1)"]
I'm currently using flex but from what I've gathered this kind of thing isn't possible with flex.
How can I achieve what I want here?

(Since one of your tags is "regex", I'll suggest a non-flex approach.)
From the example, it seems like you could just:
match each line against ^(\w+) (.+) to obtain command and arguments-text, and then
get individual arguments by splitting the arguments-text on (\[\w+\]) (assuming your regex library's split function can return both the splitter-strings and the split-strings).
It's possible your actual situation is more complex and something like flex makes more sense, but I'm not really seeing it so far.

When and why did the output of qr() change?

The output of perl's qr has changed, apparently sometime between versions 5.10.1 and 5.14.2, and the change is not documented--at least not fully.
To demonstrate the change, execute the following one-liner on each version:
perl -e 'print qr(foo)is."\n"'
Output from perl 5.10.1-17squeeze6 (Debian squeeze):
(?-xism:foo)
Output from perl 5.14.2-21+deb7u1 (Debian wheezy):
(?^:foo)
The perl documentation (perldoc perlop) says:
$rex = qr/my.STRING/is;
print $rex; # prints (?si-xm:my.STRING)
s/$rex/foo/;
which appears to no longer be true:
$ perl -e 'print qr/my.STRING/is."\n"'
(?^si:my.STRING)
I would like to know when this change occurred (which version of Perl, or supporting library or whatever).
Some background, in case it's relevant:
This change has caused a bunch of unit tests to fail. I need to decide if I should simply update the unit tests to reflect the new format, or make the tests dynamic enough to support both formats, etc. To make an informed decision, I would like to understand why the change took place. Knowing when and where it took place seems like the best place to start in that investigation.

It's documented in perl5140delta:
Regular Expressions
(?^...) construct signifies default modifiers
[...] Stringification of regular expressions now uses this notation. [...]
This change is likely to break code that compares stringified regular expressions with fixed strings containing ?-xism.
The function regexp_pattern can be used to parse the modifiers for normalisation purposes.

Part of the reason this was added, was that regular expressions were getting quite a few new modifiers.
Your example would actually produce something like this if that change didn't happen:
(?d-xismpaul:foo)
That also doesn't really express the modifiers in place.
d/u/l can only be added to a regex, not subtracted like i.
They are also mutually exclusive.
a/aa There are actually two levels for this modifier.
While work went underway adding these modifiers it was determined that this will break quite a few tests on CPAN modules.
Seeing as the tests were going to break anyway, it was agreed upon that there should be a way of specifying just use the defaults ((?^:…)).
That way, the tests wouldn't have to updated every time a new modifier was added.

To receive the stringified form of a regexp you can use Regexp::Parser and its qr method. Using this module you can not only test the representation of a regexp, but also walk a tree.

Regex/Textmate Confusion

I'm trying to create a Textmate snippet, but have run into some difficulties. Basically, I want to type in a Name and split it into its parts.
Example,
Bill Gates: (Bill), (bill), (Gates), (gates), (Bill Gates), (Bill gates), (bill Gates), (bill gates)
EDIT**
So I most certainly can produce these results quite simply if I was using a programming language. For example, I could split the words and then call the uppercase or lowercase functions to produce this output.
But in my situation I am using Textmate and it regular expression capabilities to create a tab snippet. I want to type some trigger key, ie doit, press tab and then type in a username. Then the ouput above will be created. This won't save me that much time, but I feel like I come across this sort of stuff in Textmate quite frequently and want to figure it out.
I have been using this as a reference, but still don't know how use regexps to be selective with the words and upper and lowercase the values (\u \U \l \L)
http://manual.macromates.com/en/snippets

You can use Ruby for textmate snippets. That should make it simpler.

Why is this regular expression faster?

I'm writing a Telnet client of sorts in C# and part of what I have to parse are ANSI/VT100 escape sequences, specifically, just those used for colour and formatting (detailed here).
One method I have is one to find all the codes and remove them, so I can render the text without any formatting if needed:
public static string StripStringFormating(string formattedString)
{
if (rTest.IsMatch(formattedString))
return rTest.Replace(formattedString, string.Empty);
else
return formattedString;
}
I'm new to regular expressions and I was suggested to use this:
static Regex rText = new Regex(#"\e\[[\d;]+m", RegexOptions.Compiled);
However, this failed if the escape code was incomplete due to an error on the server. So then this was suggested, but my friend warned it might be slower (this one also matches another condition (z) that I might come across later):
static Regex rTest =
new Regex(#"(\e(\[([\d;]*[mz]?))?)?", RegexOptions.Compiled);
This not only worked, but was in fact faster to and reduced the impact on my text rendering. Can someone explain to a regexp newbie, why? :)

Do you really want to do run the regexp twice? Without having checked (bad me) I would have thought that this would work well:
public static string StripStringFormating(string formattedString)
{
return rTest.Replace(formattedString, string.Empty);
}
If it does, you should see it run ~twice as fast...

The reason why #1 is slower is that [\d;]+ is a greedy quantifier. Using +? or *? is going to do lazy quantifing. See MSDN - Quantifiers for more info.
You may want to try:
"(\e\[(\d{1,2};)*?[mz]?)?"
That may be faster for you.

I'm not sure if this will help with what you are working on, but long ago I wrote a regular expression to parse ANSI graphic files.
(?s)(?:\e\[(?:(\d+);?)*([A-Za-z])(.*?))(?=\e\[|\z)
It will return each code and the text associated with it.
Input string:
<ESC>[1;32mThis is bright green.<ESC>[0m This is the default color.
Results:
[ [1, 32], m, This is bright green.]
[0, m, This is the default color.]

Without doing detailed analysis, I'd guess that it's faster because of the question marks. These allow the regular expression to be "lazy," and stop as soon as they have enough to match, rather than checking if the rest of the input matches.
I'm not entirely happy with this answer though, because this mostly applies to question marks after * or +. If I were more familiar with the input, it might make more sense to me.
(Also, for the code formatting, you can select all of your code and press Ctrl+K to have it add the four spaces required.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regular Expressions (Normal OR Nested Brackets) - regex

From #Chaos7Theory: Upon reading GtkSourceView's Specification Reference, I've figured out that it uses PCRE specifically. I then used that as a lead. Digging into it and through trial-and-error, I got it to work with: \[(([^\[\]]|(?R)))\] I hope this helps someone else in the future.

Related

OpenModelica SimulationOptions 'variableFilter' not working with '^' exceptions

Scanning a language with non-delimited strings with nested tokens

When and why did the output of qr() change?

Regex/Textmate Confusion

Why is this regular expression faster?

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regular Expressions (Normal OR Nested Brackets) - regex

From #Chaos7Theory: Upon reading GtkSourceView's Specification Reference, I've figured out that it uses PCRE specifically. I then used that as a lead. Digging into it and through trial-and-error, I got it to work with: \[(([^\[\]]*|(?R))*)\] I hope this helps someone else in the future.

Related

OpenModelica SimulationOptions 'variableFilter' not working with '^' exceptions

Scanning a language with non-delimited strings with nested tokens

When and why did the output of qr() change?

Regex/Textmate Confusion

Why is this regular expression faster?

Categories

Resources

From #Chaos7Theory: Upon reading GtkSourceView's Specification Reference, I've figured out that it uses PCRE specifically. I then used that as a lead. Digging into it and through trial-and-error, I got it to work with: \[(([^\[\]]|(?R)))\] I hope this helps someone else in the future.