Match Different Word Order Regex - regex

I have really been struggling trying to match a relatively simple set of possible word orders in a single Regex line.
Basically, I want to match these (among other grammatically similar) possibilities:
"set the var on"
"set the var off"
"set var on"
"set var off"
"set off the var"
"set on the var"
"set on var"
"set off var"
The only groups I need are "var" (which can by any single word) and the value which will always be either on or off. That's the basic idea.
With that in mind, there are two possible grammar structures:
(on/off) (perhaps a word) (a word)
(a word) (on/off)
I have been able to independently match these possibilities with the following regex:
/((on |off )([a-z]{1,})? ([a-z]{2,}))/i
/([a-z]{2,}) (on|off)/i
So, I figured I could do this:
/(((on |off )([a-z]{1,})? ([a-z]{2,})))|(([a-z]{2,}) (on|off))/i
Which is just (phrase 1)|(phrase 2), but phrase two will always match against "set off" thinking that "set" is the name. I also tried:
/((?!set)) (((on |off )([a-z]{1,})? ([a-z]{2,})))|(([a-z]{2,}) (on|off))/i
With no success.
EDIT 1: Also, I neglected to mention that these phrases can be found anywhere in the file; they are not on independent lines.
E.g.: "this is the way to set the var on" is as likely as "set the var on"
Questions:
What is the best way that I can do this together without having to
separately match?
Is there a way to force a matching order for regex OR statements?

'the' may always appear before 'var':
((the)? var)
'set' always begins the expression:
^set
'on' and 'off' are mutually exclusive but one is required:
(on|off)
'var' and 'on'/'off' appear one after the other in no particular order. All together now:
^set ((the)? var (on|off)|(on|off) (the)? var)$
Note: I'm a .NET developer. Regexes are pretty standard, and the above should work, but there may be a more efficient way to write this in perl.

Whenever you try to match complex data, you should probably try to create a grammar. Perl regexes allow you to specify a recursive grammar via (?(DEFINE)...).
use strict; use warnings; use feature 'say';
my $grammar = qr(
set \s+ (?:the \s+)? (?<variable>(?&VAR)) \s+ (?:to \s+)? (?<value>(?&VAL))
| set \s+ (?<value>(?&VAL)) \s+ (?:the \s+)? (?<variable>(?&VAR))
(?(DEFINE)
(?<VAL> on | off) # edit only here to add new values
(?<VAR> (?!the|(?&VAL)) \w+)
)
)x; # /x -- whitespace is irrelevant
while(<>){
if (/$grammar/) { say "> val: $+{value} var: $+{variable}" }
else { say "> no match" }
}
Syntax to note: (?&rule) calls a named rule. (?<name>pattern) named capture, allows access via %+ hash. Is also used to declare rules in the (DEFINE) block.
Example session:
set the switch to off!
> val: off var: switch
I would like to set something on fire...
> val: on var: something
set on the set!
> val: on var: set
set on the set off something
> val: on var: set
set on off
> no match
Do note that I made the grammar fairly unambiguous by asserting that a variable does not match a value as well. However, the above examples do show some interesting cases that may not have been parsed as it would be expected.
For a more powerful way to write grammars inside regexes, look at Regexp::Grammars.

Related

Error while compiling regex function, why am I getting this issue?

My RAKU Code:
sub comments {
if ($DEBUG) { say "<filtering comments>\n"; }
my #filteredtitles = ();
# This loops through each track
for #tracks -> $title {
##########################
# LAB 1 TASK 2 #
##########################
## Add regex substitutions to remove superflous comments and all that follows them
## Assign to $_ with smartmatcher (~~)
##########################
$_ = $title;
if ($_) ~~ s:g:mrx/ .*<?[\(^.*]> / {
# Repeat for the other symbols
########################## End Task 2
# Add the edited $title to the new array of titles
#filteredtitles.push: $_;
}
}
# Updates #tracks
return #filteredtitles;
}
Result when compiling:
Error Compiling! Placeholder variable '#_' may not be used here because the surrounding block doesn't take a signature.
Is there something obvious that I am missing? Any help is appreciated.
So, in contrast with #raiph's answer, here's what I have:
my #tracks = <Foo Ba(r B^az>.map: { S:g / <[\(^]> // };
Just that. Nothing else. Let's dissect it, from the inside out:
This part: / <[\(^]> / is a regular expression that will match one character, as long as it is an open parenthesis (represented by the \() or a caret (^). When they go inside the angle brackets/square brackets combo, it means that is an Enumerated character class.
Then, the: S introduces the non-destructive substitution, i.e., a quoting construct that will make regex-based substitutions over the topic variable $_ but will not modify it, just return its value with the modifications requested. In the code above, S:g brings the adverb :g or :global (see the global adverb in the adverbs section of the documentation) to play, meaning (in the case of the substitution) "please make as many as possible of this substitution" and the final / marks the end of the substitution text, and as it is adjacent to the second /, that means that
S:g / <[\(^]> //
means "please return the contents of $_, but modified in such a way that all its characters matching the regex <[\(^]> are deleted (substituted for the empty string)"
At this point, I should emphasize that regular expressions in Raku are really powerful, and that reading the entire page (and probably the best practices and gotchas page too) is a good idea.
Next, the: .map method, documented here, will be applied to any Iterable (List, Array and all their alikes) and will return a sequence based on each element of the Iterable, altered by a Code passed to it. So, something like:
#x.map({ S:g / foo /bar/ })
essencially means "please return a Sequence of every item on #x, modified by substituting any appearance of the substring foo for bar" (nothing will be altered on #x). A nice place to start to learn about sequences and iterables would be here.
Finally, my one-liner
my #tracks = <Foo Ba(r B^az>.map: { S:g / <[\(^]> // };
can be translated as:
I have a List with three string elements
Foo
Ba(r
B^az
(This would be a placeholder for your "list of titles"). Take that list and generate a second one, that contains every element on it, but with all instances of the chars "open parenthesis" and "caret" removed.
Ah, and store the result in the variable #tracks (that has my scope)
Here's what I ended up with:
my #tracks = <Foo Ba(r B^az>;
sub comments {
my #filteredtitles;
for #tracks -> $_ is copy {
s:g / <[\(^]> //;
#filteredtitles.push: $_;
}
return #filteredtitles;
}
The is copy ensures the variable set up by the for loop is mutable.
The s:g/...//; is all that's needed to strip the unwanted characters.
One thing no one can help you with is the error you reported. I currently think you just got confused.
Here's an example of code that generates that error:
do { #_ }
But there is no way the code you've shared could generate that error because it requires that there is an #_ variable in your code, and there isn't one.
One way I can help in relation to future problems you may report on StackOverflow is to encourage you to read and apply the guidance in Minimal Reproducible Example.
While your code did not generate the error you reported, it will perhaps help you if you know about some of the other compile time and run time errors there were in the code you shared.
Compile-time errors:
You wrote s:g:mrx. That's invalid: Adverb mrx not allowed on substitution.
You missed out the third slash of the s///. That causes mayhem (see below).
There were several run-time errors, once I got past the compile-time errors. I'll discuss just one, the regex:
.*<?[...]> will match any sub-string with a final character that's one of the ones listed in the [...], and will then capture that sub-string except without the final character. In the context of an s:g/...// substitution this will strip ordinary characters (captured by the .*) but leave the special characters.
This makes no sense.
So I dropped the .*, and also the ? from the special character pattern, changing it from <?[...]> (which just tries to match against the character, but does not capture it if it succeeds) to just <[...]> (which also tries to match against the character, but, if it succeeds, does capture it as well).
A final comment is about an error you made that may well have seriously confused you.
In a nutshell, the s/// construct must have three slashes.
In your question you had code of the form s/.../ (or s:g/.../ etc), without the final slash. If you try to compile such code the parser gets utterly confused because it will think you're just writing a long replacement string.
For example, if you wrote this code:
if s/foo/ { say 'foo' }
if m/bar/ { say 'bar' }
it'd be as if you'd written:
if s/foo/ { say 'foo' }\nif m/...
which in turn would mean you'd get the compile-time error:
Missing block
------> if m/⏏bar/ { ... }
expecting any of:
block or pointy block
...
because Raku(do) would have interpreted the part between the second and third /s as the replacement double quoted string of what it interpreted as an s/.../.../ construct, leading it to barf when it encountered bar.
So, to recap, the s/// construct requires three slashes, not two.
(I'm ignoring syntactic variants of the construct such as, say, s [...] = '...'.)

Perl switch/case Fails on Literal Regex String Containing Non-Capturing Group '?'

I have text files containing lines like:
2/17/2018 400000098627 =2,000.0 $2.0994 $4,387.75
3/7/2018 1)0000006043 2,000.0 $2.0731 $4,332.78
3/26/2018 4 )0000034242 2,000.0 $2.1729 $4,541.36
4/17/2018 2)0000008516 2,000.0 $2.219 $4,637.71
I am matching them with /^\s*(\S+)\s+(?:[0-9|\)| ]+)+\s+([0-9|.|,]+)\s+\$/ But I also have some files with lines in a completely different format, which I match with a different regex. When I open a file I determine which format and assign $pat = '<regex-string>'; in a switch/case block:
$pat = '/^\s*(\S+)\s+(?:[0-9|\)| ]+)+\s+([0-9|.|,]+)\s+\$/'
But the ? character that introduces the non-capturing group I use to match repeats after the date and before the first currency amount causes the Perl interpreter to fail to compile the script, reporting on abort:
syntax error at ./report-dates-amounts line 28, near "}continue "
If I delete the ? character, or replace ? with \? escaped character, or first assign $q = '?' then replace ? with $q inside a " string assignment (ie. $pat = "/^\s*(\S+)\s+($q:[0-9|\)| ]+)+\s+([0-9|.|,]+)\s+\$/"; ) the script compiles and runs. If I assign the regex string outside the switch/case block that also works OK. Perl v5.26.1 .
My code also doesn't have any }continue in it, which as reported in the compilation failure is probably some kind of transformation of the switch/case code by Switch.pm into something native the compiler chokes on. Is this some kind of bug in Switch.pm? It fails even when I use given/when in exactly the same way.
#!/usr/local/bin/perl
use Switch;
# Edited for demo
switch($format)
{
# Format A eg:
# 2/17/2018 400000098627 =2,000.0 $2.0994 $4,387.75
# 3/7/2018 1)0000006043 2,000.0 $2.0731 $4,332.78
# 3/26/2018 4 )0000034242 2,000.0 $2.1729 $4,541.36
# 4/17/2018 2)0000008516 2,000.0 $2.219 $4,637.71
#
case /^(?:april|snow)$/i
{ # This is where the ? character breaks compilation:
$pat = '^\s*(\S+)\s+(?:[0-9|\)| ]+)+\s+\D?(\S+)\s+\$';
# WORKS:
# $pat = '^\s*(\S+)\s+(' .$q. ':[0-9|\)| ]+)+\s+\D' .$q. '(\S+)\s+\$';
}
# Format B
case /^(?:umberto|petro)$/i
{
$pat = '^(\S+)\s+.*Think 1\s+(\S+)\s+';
}
}
Don't use Switch. As mentionned by #choroba in the comments, Switch uses a source filter, which leads to mysterious and hard to debug errors, as you constated.
The module's documentation itself says:
In general, use given/when instead. It were introduced in perl 5.10.0. Perl 5.10.0 was released in 2007.
However, given/when is not necessarily a good option as it is experimental and likely to change in the future (it seems that this feature was almost removed from Perl v5.28; so you definitely don't want to start using it now if you can avoid it). A good alternative is to use for:
for ($format) {
if (/^(?:april|snow)$/i) {
...
}
elsif (/^(?:umberto|petro)$/i) {
...
}
}
It might look weird a first, but once you get used to it, it's actually reasonable in my opinion. Or, of course, you can use none of this options and just do:
sub pattern_from_format {
my $format = shift;
if ($format =~ /^(?:april|snow)$/i) {
return qr/^\s*(\S+)\s+(?:[0-9|\)| ]+)+\s+\D?(\S+)\s+\$/;
}
elsif ($format =~ /^(?:umberto|petro)$/i) {
return qr/^(\S+)\s+.*Think 1\s+(\S+)\s+/;
}
# Some error handling here maybe
}
If, for some reason, you still want to use Switch: use m/.../ instead of /.../.
I have no idea why this bug is happening, however, the documentation says:
Also, the presence of regexes specified with raw ?...? delimiters may cause mysterious errors. The workaround is to use m?...? instead.
Which I misread at first, and therefore tried to use m/../ instead of /../, which fixed the issue.
Another option instead of an if/elsif chain would be to loop over a hash which maps your regular expressions to the values which should be assigned to $pat:
#!/usr/local/bin/perl
my %switch = (
'^(?:april|snow)$' => '^\s*(\S+)\s+(?:[0-9|\)| ]+)+\s+\D?(\S+)\s+\$',
'^(?:umberto|petro)$' => '^(\S+)\s+.*Think 1\s+(\S+)\s+',
);
for my $re (keys %switch) {
if ($format =~ /$re/i) {
$pat = $switch{$re};
last;
}
}
For a more general case (i.e., if you're doing more than just assigning a string to a scalar) you could use the same general technique, but use coderefs as the values of your hash, thus allowing it to execute an arbitrary sub based on the match.
This approach can cover a pretty wide range of the functionality usually associated with switch/case constructs, but note that, because the conditions are pulled from the keys of a hash, they'll be evaluated in a random order. If you have data which could match more than one condition, you'll need to take extra precautions to handle that, such as having a parallel array with the conditions in the proper order or using Tie::IxHash instead of a regular hash.

Regex to select text outside of underscores

I am looking for a regex to select the text which falls outside of underscore characters.
Sample text:
PartIWant_partINeedIgnored_morePartsINeedIgnored_PartIwant
Basically I need to be able to select the first keyword which is always before the first underscore and the last keyword which is always after the last underscore. As an additional complexity, there case also be texts which have no underscore at all, these need to be selected completely as well.
The best I got yet was this expression:
^((?! *\_[^)]*\_ *).)*
which is only yielding me the first part, not the second and it has no support for the non-underscore yet at all.
This regex is used in a tool which monitors our http traffic, which means I can only 'select' the part I need but can't invoke functions or replace logic.
Thanks!
Use JavaScript string function split(). Check below example.
var t = "PartIWant_partINeedIgnored_morePartsINeedIgnored_PartIwant";
var arr = t.split('_');
console.log(arr);
//Access the required parts like this
console.log(arr[0] + ' ' + arr[arr.length - 1]);
Perhaps something like this:
/(^[^_]+)|([^_]+$)/g
That is, match either:
^[^_]+ the beginning of the string followed by non-underscores, or
[^_]+$ non-underscores followed by the end of the string.
var regex = /(^[^_]+)|([^_]+$)/g
console.log("A_b_c_D".match(regex)) // ["A", "D"]
console.log("A_b_D".match(regex)) // ["A", "D"]
console.log("A_D".match(regex)) // ["A", "D"]
console.log("AD".match(regex)) // ["AD"]
I'm not sure if you should use a regex here. I think splitting the string at underscore, and using the first and last element of the resulting array might be faster, and less complicated.
Trivial with .replace:
str.replace(/_.*_/, '')
// "PartIWantPartIwant"
With matching, you'd need to be selecting and concatenating groups:
parts = str.match(/^([^_]*).*?([^_]*)$/)
parts[1] + parts[2]
// "PartIWantPartIwant"
EDIT
This regex is used in a tool which monitors our http traffic, which means I can only 'select' the part I need but can't invoke functions or replace logic.
This is not possible: a regular expression cannot match a discontinuous span.

In Vim, is there a "matching braces/parenthesis/etc" equivalent in substitute/search symbols?

I would like to replace for instance every occurrence of "foo{...}" with anything except newlines inside the bracket (there may be spaces, other brackets opened AND closed, etc) NOT followed by "bar".
For instance, the "foo{{ }}" in "foo{{ }}, bar" would match but not "foo{hello{}}bar".
I've tried /foo{.*}\(bar\)\#! and /foo{.\{-}}\(bar\)\#! but the first one would match "foo{}bar{}" and the second would match "foo{{}}bar" (only the "foo{{}" part).
this regex:
foo{.*}\([}]*bar\)\#!
matches:
foo{{ }}
foo{{ }}, bar
but not:
foo{hello{}}bar
It is impossible to correctly match an arbitrary level of nested
parentheses using regular expressions. However, it is possible to
construct a regex to match supporting a limited amount of nesting (I
think this answer did not attempt to do so). – Ben
This does ...
for up to one level of inner braces:
/foo{[^{}]*\({[^{}]*}[^{}]*\)*}\(bar\)\#!
for up to two levels of inner braces:
/foo{[^{}]*\({[^{}]*\({[^{}]*}[^{}]*\)*}[^{}]*\)*}\(bar\)\#!
for up to three levels of inner braces:
/foo{[^{}]*\({[^{}]*\({[^{}]*\({[^{}]*}[^{}]*\)*}[^{}]*\)*}[^{}]*\)*}\(bar\)\#!
...
Depends on what replacement you want to perform exactly, you might be able to do that with macros.
For example: Given this text
line 1 -- -- -- -- array[a][b[1]]
line 2 -- array[c][d]
line 3 -- -- -- -- -- -- -- array[e[0]][f] + array[g[0]][h[0]]
replace array[A][B] with get(A, B).
To do that:
Position the cursor at the begin of the text
/array<cr>
qq to begin recording a macro
Do something to change the data independent of the content inside (use % to go to matching bracket, and some register/mark/plugin to delete around the bracket). For example cwget(<esc>ldi[vhpa, <esc>ldi[vhpa)<esc>n -- but macros are usually unreadable.
n to go to next match, q to stop recording
#q repeatedly (## can be used from the second time)
This is probably not very convenient because it's easy to make a mistake (press I, <home>, A for example) and you have to redo the macro from the beginning, but it works.
Alternatively, you can do something similar to eregex.vim plugin to extend vim's regex format to support this (so you don't have to retype the huge regex every time).
Proof of concept:
"does not handle different magic levels
"does not handle '\/' or different characters for substitution ('s#a#b#')
"does not handle brackets inside strings
" usage: `:M/pattern, use \zm for matching block/replacement/flags`
command -range -nargs=* M :call SubstituteWithMatching(<q-args>, <line1>, <line2>)
":M/ inspired from eregex.vim
function SubstituteWithMatching(command, line1, line2)
let EscapeRegex={pattern->escape(pattern, '[]\')}
let openbracket ='([{'
let closebracket=')]}'
let nonbracketR='[^'.EscapeRegex(openbracket.closebracket).']'
let nonbracketsR=nonbracketR.'*'
let LiftLevel={pattern->
\nonbracketsR
\.'\%('
\.'['.EscapeRegex(openbracket).']'
\.pattern
\.'['.EscapeRegex(closebracket).']'
\.nonbracketsR
\.'\)*'
\}
let matchingR=LiftLevel(LiftLevel(LiftLevel(nonbracketsR)))
if v:false " optional test suite
echo "return 0:"
echo match('abc', '^'.matchingR.'$')
echo match('abc(ab)de', '^'.matchingR.'$')
echo match('abc(ab)d(e)f', '^'.matchingR.'$')
echo match('abc(a[x]b)d(e)f', '^'.matchingR.'$')
echo match('abc(a]b', '^'.matchingR.'$')
"current flaw (not a problem if there's only one type of bracket, or if
"the code is well-formed)
echo "return -1:"
echo match('abc(a(b', '^'.matchingR.'$')
echo match('abc)a(b', '^'.matchingR.'$')
endif
let [pattern, replacement, flags]=split(a:command, "/")
let pattern=substitute(pattern, '\\zm', EscapeRegex(matchingR), 'g')
execute a:line1.','.a:line2.'s/'.pattern.'/'.replacement.'/'.flags
endfunction
After this, :'<,'>M/array\[\(\zm\)\]\[\(\zm\)\]/get(\1, \2)/g can be used to do the same task above (after selecting the text in visual mode)

Regular expression to match CSV delimiters

I'm trying to create a PCRE that will match only the commas used as delimiters in a line from a CSV file. Assuming the format of a line is this:
1,"abcd",2,"de,fg",3,"hijk"
I want to match all of the commas except for the one between the 'e' and 'f'. Alternatively, matching just that one is acceptable, if that is the easier or more sensible solution. I have the sense that I need to use a negative lookahead assertion to handle this, but I'm finding it a bit too difficult to figure out.
See my post that solves this problem for more detail.
^(?:(?:"((?:""|[^"])+)"|([^,]*))(?:$|,))+$ Will match the whole line, then you can use match.Groups[1 ].Captures to get your data out (without the quotes). Also, I let "My name is ""in quotes""" be a valid string.
CSV parsing is a difficult problem, and has been well-solved. Whatever language you are using doubtless has a complete solution that takes care of it, without you having to go down the road of writing your own regex.
What language are you using?
As you've already been told, a regular expression is really not appropriate; it is tricky to deal with the general case (doubly so if newlines are allowed in fields, and triply so if you might have to deal with malformed CSV data.
I suggest the tool CSVFIX as likely to do what you need.
To see how bad CSV can be, consider this data (with 5 clean fields, two of them empty):
"""",,"",a,"a,b"
Note that the first field contains just one double quote. Getting the two double quotes squished to one is really rather tough; you probably have to do it with a second pass after you've captured both with the regex. And consider this ill-formed data too:
"",,"",a",b c",
The problem there is that the field that starts with a contains a double quote; how to interpret it? Stop at the comma? Then the field that starts with b is similarly ill-formed. Stop at the next quote? So the field is a",b c" (or should the quotes be removed)? Etc...yuck!
This Perl gets pretty close to handling correctly both the above lines of data with a ghastly regex:
use strict;
use warnings;
my #list = ( q{"""",,"",a,"a,b"}, q{"",,"",a",b c",} );
foreach my $string (#list)
{
print "Pattern: <<$string>>\n";
while ($string =~ m/ (?: " ( (?:""|[^"])* ) " | ( [^,"] [^,]* ) | ( .? ) )
(?: $ | , ) /gx)
{
print "Found QF: <<$1>>\n" if defined $1;
print "Found PF: <<$2>>\n" if defined $2;
print "Found EF: <<$3>>\n" if defined $3;
}
}
Note that as written, you have to identify which of the three captures was actually used. With two stage processing, you could just deal with one capture and then strip out enclosing double quotes and nested doubled up double quotes. This regex assumes that if the field does not start with a double quote, then there double quote has no special meaning within the field. Have fun ringing the changes!
Output:
Pattern: <<"""",,"",a,"a,b">>
Found QF: <<"">>
Found EF: <<>>
Found QF: <<>>
Found PF: <<a>>
Found QF: <<a,b>>
Found EF: <<>>
Pattern: <<"",,"",a",b c",>>
Found QF: <<>>
Found EF: <<>>
Found QF: <<>>
Found PF: <<a">>
Found PF: <<b c">>
Found EF: <<>>
We can debate whether the empty field (EF) at the end of the first pattern is correct; it probably isn't, which is why I said 'pretty close'. OTOH, the EF at the end of the second pattern is correct.
Also, the extraction of two double quotes from the field """" is not the final result you want; you'd have to post-process the field to eliminate one of each adjacent pair of double quotes.
Without thinking to hard, I would do something like [0-9]+|"[^"]*" to match everything except the comma delimiters. Would that do the trick?
Without context it's impossible to give a more specific solution.
Andy's right: correctly parsing CSV is a lot harder than you probably realise, and has all kinds of ugly edge cases. I suspect that it's mathematically impossible to correctly parse CSV with regexes, particularly those understood by sed.
Instead of sed, use a Perl script that uses the Text::CSV module from CPAN (or the equivalent in your preferred scripting language). Something like this should do it:
use Text::CSV;
use feature 'say';
my $csv = Text::CSV->new ( { binary => 1, eol => $/ } )
or die "Cannot use CSV: ".Text::CSV->error_diag ();
my $rows = $csv->getline_all(STDIN);
for my $row (#$rows) {
say join("\t", #$row);
}
That assumes that you don't have any tab characters embedded in your data, of course - perhaps it would be better to do the subsequent stages in a Real Scripting Language as well, so you could take advantage of proper lists?
I know this is old, but this RegEx works for me:
/(\"[^\"]+\")|[^,]+/g
It could be use potentially with any language. I tested it in JavaScript, so the g is just a global modifier. It works even with messed up lines (extra quotes), but empty is not dealt with.
Just sharing, maybe this will help someone.