RegEx in Powershell, combine replace calls - regex

I've written my own CSS minifier for fun and profit (not so much profit), and it works great. I am now trying to streamline it, since I'm essentially filtering the file 10+ times. Not a huge deal with a small file, but the larger they get, the worse that performance hit will be.
Is there a more elegant way to filter my input file? I'm assuming regex will have a way, but I am no regex wizard...
$a = (gc($path + $file) -Raw)
$a = $a -replace "\s{2,100}(?<!\S)", ""
$a = $a -replace " {", "{"
$a = $a -replace "} ", "}"
$a = $a -replace " \(", "\("
$a = $a -replace "\) ", "\)"
$a = $a -replace " \[", "\["
$a = $a -replace "\] ", "\]"
$a = $a -replace ": ", ":"
$a = $a -replace "; ", ";"
$a = $a -replace ", ", ","
$a = $a -replace "\n", ""
$a = $a -replace "\t", ""
To save you a little headache, i'm basically using the first -replace to strip any successive witespace from 2-100 characters in length.
The remaining replace statements cover cleaning up single spaces in specific circumstances.
How can I combine this, so I'm not filtering the file 12 times?

negative lookbehind (?<!\S) is used in this scenario: (?<!prefix)thing to match a thing which does not have the prefix on the left. When you put it at the end of the regex, with nothing after it, I think it does nothing at all. You might have intended it to go on the left, or might have intended to to be a negative lookahead, I won't try to guess, I'll just remove it for this answer.
You're missing the use of character classes. abc looks for the text abc, but put them in square brackets and [abc] looks for any of the characters a, b, c.
Using that, you can combine the last two lines into one: [\n\t] which replace either a newline or a tab.
You can combine the two separate (replace with nothing) rules using regex logical OR | to make one match: \s{2,100}|[\n\t] - match the spaces or the newline or tab. (You could probably use OR twice instead of characters, fwiw).
Use regex capture groups which allow you to reference whatever the regex matched, without knowing in advance what that was.
e.g. "space bracket -> bracket" and "space colon -> colon" and "space comma -> comma" all follow the general pattern "space (thing) -> (thing)". And the same with the trailing spaces "(thing) space -> (thing)".
Combine capture groups with character classes to merge the rest of the lines all into one.
e.g.
$a -replace " (:)", '$1' # capture the colon, replacement is not ':'
# it is "whatever was in the capture group"
$a -replace " ([:,])", '$1' # capture the colon, or comma. Replacement
# is "whatever was in the capture group"
# space colon -> colon, space comma -> comma
# make the space optional with \s{0,1} and put it at the start and end
\s{0,1}([:,])\s{0,1} #now it will match "space (thing)" or "(thing) space"
# Add in the rest of the characters, with appropriate \ escapes
# gained from [regex]::Escape('those chars here')
# Your original:
$a = (gc D:\css\1.css -Raw)
$a = $a -replace "\s{2,100}(?<!\S)", ""
$a = $a -replace " {", "{"
$a = $a -replace "} ", "}"
$a = $a -replace " \(", "\("
$a = $a -replace "\) ", "\)"
$a = $a -replace " \[", "\["
$a = $a -replace "\] ", "\]"
$a = $a -replace ": ", ":"
$a = $a -replace "; ", ";"
$a = $a -replace ", ", ","
$a = $a -replace "\n", ""
$a = $a -replace "\t", ""
# My version:
$b = gc d:\css\1.css -Raw
$b = $b -replace "\s{2,100}|[\n\t]", ""
$b = $b -replace '\s{0,1}([])}{([:;,])\s{0,1}', '$1'
# Test that they both do the same thing on my random downloaded sample file:
$b -eq $a
# Yep.
Do that again with another | to combine the two into one:
$c = gc d:\css\1.css -Raw
$c = $c -replace "\s{2,100}|[\n\t]|\s{0,1}([])}{([:;,])\s{0,1}", '$1'
$c -eq $a # also same output as your original.
NB. that the space and tab and newline capture nothing, so '$1' is empty,
which removes them.
And you can spend lots of time building your own unreadable regex which probably won't be noticeably faster in any real scenario. :)
NB. '$1' in the replacement, the dollar is a .Net regex engine syntax, not a PowerShell variable. If you use double quotes, PowerShell will string interpolate from the variable $1 and likely replace it with nothing.

You may join the patterns that are similar into 1 bigger expression with capturing groups, and use a callback inside a Regex replace method where you may evaluate the match structure and use appropriate action.
Here is a solution for your scenario that you may extend:
$callback = { param($match)
if ($match.Groups[1].Success -eq $true) { "" }
else {
if ($match.Groups[2].Success -eq $true) { $match.Groups[2].Value }
else {
if ($match.Groups[3].Success -eq $true) { $match.Groups[3].Value }
else {
if ($match.Groups[4].Success -eq $true) { $match.Groups[4].Value }
}
}
}
}
$path = "d:\input\folder\"
$file = "input_file.txt"
$a = [IO.File]::ReadAllText($path + $file)
$rx = [regex]'(\s{2,100}(?<!\S)|[\n\t])|\s+([{([])|([])}])\s+|([:;,])\s+'
$rx.Replace($a, $callback) | Out-File "d:\result\file.txt"
Pattern details:
(\s{2,100}(?<!\S)|[\n\t]) - Group 1 capturing 2 to 100 whitespaces not preceded with a non-whitespace char (maybe this lookbehind is redundant) OR a newline or tab char
| - or
\s+([{([]) - just matching one or more whitespaces (\s+), and then capturing into Group 2 any single char from the [{([] character class: {, ( or [
|([])}])\s+ - or Group 3 capturing any single char from the [])}] character class: }, ) or ] and then just matching one or more whitespaces
|([:;,])\s+ - or Group 4 capturing any char from [:;,] char class (:, ; or ,) and one or more whitespaces.

Related

Could regex be used in this PowerShell script?

I have the following code, used to remove spaces and other characters from a string $m, and replace them with periods ('.'):
Function CleanupMessage([string]$m) {
$m = $m.Replace(' ', ".") # spaces to dot
$m = $m.Replace(",", ".") # commas to dot
$m = $m.Replace([char]10, ".") # linefeeds to dot
while ($m.Contains("..")) {
$m = $m.Replace("..",".") # multiple dots to dot
}
return $m
}
It works OK, but it seems like a lot of code and can be simplified. I've read that regex can work with patterns, but am not clear if that would work in this case. Any hints?
Use a regex character class:
Function CleanupMessage([string]$m) {
return $m -replace '[ ,.\n]+', '.'
}
EXPLANATION
--------------------------------------------------------------------------------
[ ,.\n]+ any character of: ' ', ',', '.', '\n' (newline)
(1 or more times (matching the most amount
possible))
Solution for this case:
cls
$str = "qwe asd,zxc`nufc..omg"
Function CleanupMessage([String]$m)
{
$m -replace "( |,|`n|\.\.)", '.'
}
CleanupMessage $str
# qwe.asd.zxc.ufc.omg
Universal solution. Just enum in $toReplace what do you want to replace:
cls
$str = "qwe asd,zxc`nufc..omg+kfc*fox"
Function CleanupMessage([String]$m)
{
$toReplace = " ", ",", "`n", "..", "+", "fox"
.{
$d = New-Guid
$regex = [Regex]::Escape($toReplace-join$d).replace($d,"|")
$m -replace $regex, '.'
}
}
CleanupMessage $str
# qwe.asd.zxc.ufc.omg.kfc*.

Dynamic regular expression for Nesting brackets failed due to unknow bugs

rencently I have met a strange bug when use a dynamic regular expressions in perl for Nesting brackets' match. The origin string is " {...test{...}...} ", I want to grep the pair brace begain with test, "test{...}". actually there are probably many pairs of brace before and end this group , I don't really know the deepth of them.
Following is my match scripts: nesting_parser.pl
#! /usr/bin/env perl
use Getopt::Long;
use Data::Dumper;
my %args = #ARGV;
if(exists$args{'-help'}) {printhelp();}
unless ($args{'-file'}) {printhelp();}
unless ($args{'-regex'}) {printhelp();}
my $OpenParents;
my $counts;
my $NestedGuts = qr {
(?{$OpenParents = 0})
(?>
(?:
[^{}]+
| \{ (?{$OpenParents++;$counts++; print "\nLeft:".$OpenParents." ;"})
| \} (?(?{$OpenParents ne 0; $counts++}) (?{$OpenParents--;print "Right: ".$OpenParents." ;"})) (?(?{$OpenParents eq 0}) (?!))
)*
)
}x;
my $string = `cat $args{'-file'}`;
my $partten = $args{'-regex'} ;
print "####################################################\n";
print "Grep [$partten\{...\}] from $args{'-file'}\n";
print "####################################################\n";
while ($string =~ /($partten$NestedGuts)/xmgs){
print $1."}\n";
print $2."####\n";
}
print "Regex has seen $counts brackts\n";
sub printhelp{
print "Usage:\n";
print "\t./nesting_parser.pl -file [file] -regex '[regex expression]'\n";
print "\t[file] : file path\n";
print "\t[regex] : regex string\n";
exit;
}
Actually my regex is:
our $OpenParents;
our $NestedGuts = qr {
(?{$OpenParents = 0})
(?>
(?:
[^{}]+
| \{ (?{$OpenParents++;})
| \} (?(?{$OpenParents ne 0}) (?{$OpenParents--})) (?(?{$OpenParents eq 0} (?!))
)*
)
}x;
I have add brace counts in nesting_parser.pl
I also write a string generator for debug: gen_nesting.pl
#! /usr/bin/env perl
use strict;
my $buffer = "{{{test{";
unless ($ARGV[0]) {print "Please specify the nest pair number!\n"; exit}
for (1..$ARGV[0]){
$buffer.= "\n\{\{\{\{$_\}\}\}\}";
#$buffer.= "\n\{\{\{\{\{\{\{\{\{$_\}\}\}\}\}\}\}\}\}";
}
$buffer .= "\n\}}}}";
open TEXT, ">log_$ARGV[0]";
print TEXT $buffer;
close TEXT;
You can generate a test file by
./gen_nesting.pl 1000
It will create a log file named log_1000, which include 1000 lines brace pairs
Now we test our match scripts:
./nesting_parser.pl -file log_1000 -regex "test" > debug_1000
debug_1000 looks like a great perfect result, matched successfully! But when I gen a 4000 lines test log file and match it again, it seem crashed:
./gen_nesting.pl 4000
./nesting_parser.pl -file log_4000 -regex "test" > debug_4000
The end of debug_4000 shows
{{{{3277}
####
Regex has seen 26213 brackts
I don't know what's wrong with the regex expresions, mostly it works well for paired brackets, untill recently I found it crashed when I try to match a text file more than 600,000 lines.
I'm really confused by this problems,
I really hope to solve this problem.
thank you all!
First for matching nested brackets I normally use Regexp::Common.
Next, I'm guessing that your problem is that Perl's regular expression engine breaks after matching 32767 groups. You can verify this by turning on warnings and looking for a message like Complex regular subexpression recursion limit (32766) exceeded.
If so, you can rewrite your code using /g and \G and pos. The idea being that you match the brackets in a loop like this untested code:
my $start = pos($string);
my $open_brackets = 0;
my $failed;
while (0 < $open_brackets or $start == pos($string)) {
if ($string =~ m/\G[^{}]*(\{|\})/g) {
if ($1 eq '{') {
$open_brackets++;
}
else {
$open_brackets--;
}
}
else {
$failed = 1;
break; # WE FAILED TO MATCH
}
}
if (not $failed and 0 == $open_brackets) {
my $matched = substr($string, $start, pos($string));
}

perl regex matching, why is it not finding all matches, why is the order important?

I ran into a problem with perl's regex matching. I destilled it down to a small example on the command line. Why is the order in which the matches are attempted important here ?
1.
$ echo "XYG" | perl -ne 'if ($_ =~ m/X/gi) { print "Matches X\n"; } ; if ($_ =~ m/Y/gi) { print "Matches Y\n"; } ; if ($_ =~ m/G/gi) { print "Matches G\n"; } '
Matches X
Matches Y
Matches G
2.
$ echo "GXY" | perl -ne 'if ($_ =~ m/X/gi) { print "Matches X\n"; } ; if ($_ =~ m/Y/gi) { print "Matches Y\n"; } ; if ($_ =~ m/G/gi) { print "Matches G\n"; } else { print "No match on G\n"; } '
Matches X
Matches Y
No match on G
The 1. examples matches all three letters as expected, but the second example does not match the letter G, why ?
However if I create an intermediate variable, here named $aa:
$ echo "GXY" | perl -ne 'if ($_ =~ m/X/gi) { print "Matches X\n"; } ; if ($_ =~ m/Y/gi) { print "Matches Y\n"; } ; $aa = $_; if ($aa =~ m/G/gi) { print "Matches G\n"; } '
Matches X
Matches Y
Matches G
Then the match works again ?
My perl version is:
$ perl -e 'print "$]\n";'
5.022001
On a LM 18.2 machine
$ lsb_release -d
Description: Linux Mint 18.2 Sonya
Ty+BR
Max.
Because if you match a regex in a scalar context like that, and you set the g flag (for global matching) it's iterative - that's to allow you to do things like while ( m/somepattern/g ) { and have it trigger multiple times.
That's because g means:
g - globally match the pattern repeatedly in the string
It'd not be particularly useful if it reset each time you tried it. But you can also use it slightly differently in an array context:
my #matches = $str =~ m/(some_capture)/g;
And that'll select them all into a list.
But with your code and regex debugging:
#!/usr/bin/env perl
use strict;
use warnings;
use re 'debug';
$_ = 'GXY';
if ( $_ =~ m/X/gi ) { print "Matches X\n"; }
if ( $_ =~ m/Y/gi ) { print "Matches Y\n"; }
if ( $_ =~ m/G/gi ) { print "Matches G\n"; }
else { print "No match on G\n"; }
You'll get (snipped for brevity):
Matching REx "X" against "GXY"
Matching REx "Y" against "Y"
Matching REx "G" against ""
The first match 'eats' "GX" to find "X", leaving "Y" for the next match, but nothing at all for the "G" match.
The simple workaround is omit the g flag, because then you're saying explicitly 'match once' and you'll get:
Matches X
Matches Y
Matches G
Alternatively, you can use the global match with a character class:
$_ = 'GXY';
my #matches = m/([GYX])/g; #implicitly operates on $_
print "Match on $_\n" for #matches;

PowerShell -split on Pipe Character

Consider the ASCII text file test1.txt:
a,b,c
d,e,f
And the following Powershell Script test1.ps1:
$input -split "`n" | ForEach-Object {
$row = $_ -split ","
$row[0]
}
The output is, as excpected:
a
d
However, if we change the separator to | everything fails as in test2.txt:
a|b|c
d|e|f
And the following Powershell Script test2.ps1:
$input -split "`n" | ForEach-Object {
$row = $_ -split "|"
$row[0]
}
The output is all but empty. Why does the -split fail?
It seems -split expects a regular expression and thus you need to escape the pipe as in:
$row = $_ -split "\|"
Or specify the SimpleMatch option to split on the literal string or character:
$row = $_ -split "|", 0, "SimpleMatch"
The 0 stands for MaxSubstrings: "The maximum number of substrings, by default all (0)."
Source: http://ss64.com/ps/split.html
Also: Get-Help about_Split

How can I count the amount of spaces at the start of a string in Perl?

How can I count the amount of spaces at the start of a string in Perl?
I now have:
$temp = rtrim($line[0]);
$count = ($temp =~ tr/^ //);
But that gives me the count of all spaces.
$str =~ /^(\s*)/;
my $count = length( $1 );
If you just want actual spaces (instead of whitespace), then that would be:
$str =~ /^( *)/;
Edit: The reason why tr doesn't work is it's not a regular expression operator. What you're doing with $count = ( $temp =~ tr/^ // ); is replacing all instances of ^ and with itself (see comment below by cjm), then counting up how many replacements you've done. tr doesn't see ^ as "hey this is the beginning of the string pseudo-character" it sees it as "hey this is a ^".
You can get the offset of a match using #-. If you search for a non-whitespace character, this will be the number of whitespace characters at the start of the string:
#!/usr/bin/perl
use strict;
use warnings;
for my $s ("foo bar", " foo bar", " foo bar", " ") {
my $count = $s =~ /\S/ ? $-[0] : length $s;
print "'$s' has $count whitespace characters at its start\n";
}
Or, even better, use #+ to find the end of the whitespace:
#!/usr/bin/perl
use strict;
use warnings;
for my $s ("foo bar", " foo bar", " foo bar", " ") {
$s =~ /^\s*/;
print "$+[0] '$s'\n";
}
Here's a script that does this for every line of stdin. The relevant snippet of code is the first in the body of the loop.
#!/usr/bin/perl
while ($x = <>) {
$s = length(($x =~ m/^( +)/)[0]);
print $s, ":", $x, "\n";
}
tr/// is not a regex operator. However, you can use s///:
use strict; use warnings;
my $t = (my $s = " \t\n sdklsdjfkl");
my $n = 0;
++$n while $s =~ s{^\s}{};
print "$n \\s characters were removed from \$s\n";
$n = ( $t =~ s{^(\s*)}{} ) && length $1;
print "$n \\s characters were removed from \$t\n";
Since the regexp matcher returns the parenthesed matches when called in a list context, CanSpice's answer can be written in a single statement:
$count = length( ($line[0] =~ /^( *)/)[0] );
This prints amount of white space
echo " hello" |perl -lane 's/^(\s+)(.*)+$/length($1)/e; print'
3