Could regex be used in this PowerShell script? - regex

I have the following code, used to remove spaces and other characters from a string $m, and replace them with periods ('.'):
Function CleanupMessage([string]$m) {
$m = $m.Replace(' ', ".") # spaces to dot
$m = $m.Replace(",", ".") # commas to dot
$m = $m.Replace([char]10, ".") # linefeeds to dot
while ($m.Contains("..")) {
$m = $m.Replace("..",".") # multiple dots to dot
}
return $m
}
It works OK, but it seems like a lot of code and can be simplified. I've read that regex can work with patterns, but am not clear if that would work in this case. Any hints?

Use a regex character class:
Function CleanupMessage([string]$m) {
return $m -replace '[ ,.\n]+', '.'
}
EXPLANATION
--------------------------------------------------------------------------------
[ ,.\n]+ any character of: ' ', ',', '.', '\n' (newline)
(1 or more times (matching the most amount
possible))

Solution for this case:
cls
$str = "qwe asd,zxc`nufc..omg"
Function CleanupMessage([String]$m)
{
$m -replace "( |,|`n|\.\.)", '.'
}
CleanupMessage $str
# qwe.asd.zxc.ufc.omg
Universal solution. Just enum in $toReplace what do you want to replace:
cls
$str = "qwe asd,zxc`nufc..omg+kfc*fox"
Function CleanupMessage([String]$m)
{
$toReplace = " ", ",", "`n", "..", "+", "fox"
.{
$d = New-Guid
$regex = [Regex]::Escape($toReplace-join$d).replace($d,"|")
$m -replace $regex, '.'
}
}
CleanupMessage $str
# qwe.asd.zxc.ufc.omg.kfc*.

Related

RegEx in Powershell, combine replace calls

I've written my own CSS minifier for fun and profit (not so much profit), and it works great. I am now trying to streamline it, since I'm essentially filtering the file 10+ times. Not a huge deal with a small file, but the larger they get, the worse that performance hit will be.
Is there a more elegant way to filter my input file? I'm assuming regex will have a way, but I am no regex wizard...
$a = (gc($path + $file) -Raw)
$a = $a -replace "\s{2,100}(?<!\S)", ""
$a = $a -replace " {", "{"
$a = $a -replace "} ", "}"
$a = $a -replace " \(", "\("
$a = $a -replace "\) ", "\)"
$a = $a -replace " \[", "\["
$a = $a -replace "\] ", "\]"
$a = $a -replace ": ", ":"
$a = $a -replace "; ", ";"
$a = $a -replace ", ", ","
$a = $a -replace "\n", ""
$a = $a -replace "\t", ""
To save you a little headache, i'm basically using the first -replace to strip any successive witespace from 2-100 characters in length.
The remaining replace statements cover cleaning up single spaces in specific circumstances.
How can I combine this, so I'm not filtering the file 12 times?
negative lookbehind (?<!\S) is used in this scenario: (?<!prefix)thing to match a thing which does not have the prefix on the left. When you put it at the end of the regex, with nothing after it, I think it does nothing at all. You might have intended it to go on the left, or might have intended to to be a negative lookahead, I won't try to guess, I'll just remove it for this answer.
You're missing the use of character classes. abc looks for the text abc, but put them in square brackets and [abc] looks for any of the characters a, b, c.
Using that, you can combine the last two lines into one: [\n\t] which replace either a newline or a tab.
You can combine the two separate (replace with nothing) rules using regex logical OR | to make one match: \s{2,100}|[\n\t] - match the spaces or the newline or tab. (You could probably use OR twice instead of characters, fwiw).
Use regex capture groups which allow you to reference whatever the regex matched, without knowing in advance what that was.
e.g. "space bracket -> bracket" and "space colon -> colon" and "space comma -> comma" all follow the general pattern "space (thing) -> (thing)". And the same with the trailing spaces "(thing) space -> (thing)".
Combine capture groups with character classes to merge the rest of the lines all into one.
e.g.
$a -replace " (:)", '$1' # capture the colon, replacement is not ':'
# it is "whatever was in the capture group"
$a -replace " ([:,])", '$1' # capture the colon, or comma. Replacement
# is "whatever was in the capture group"
# space colon -> colon, space comma -> comma
# make the space optional with \s{0,1} and put it at the start and end
\s{0,1}([:,])\s{0,1} #now it will match "space (thing)" or "(thing) space"
# Add in the rest of the characters, with appropriate \ escapes
# gained from [regex]::Escape('those chars here')
# Your original:
$a = (gc D:\css\1.css -Raw)
$a = $a -replace "\s{2,100}(?<!\S)", ""
$a = $a -replace " {", "{"
$a = $a -replace "} ", "}"
$a = $a -replace " \(", "\("
$a = $a -replace "\) ", "\)"
$a = $a -replace " \[", "\["
$a = $a -replace "\] ", "\]"
$a = $a -replace ": ", ":"
$a = $a -replace "; ", ";"
$a = $a -replace ", ", ","
$a = $a -replace "\n", ""
$a = $a -replace "\t", ""
# My version:
$b = gc d:\css\1.css -Raw
$b = $b -replace "\s{2,100}|[\n\t]", ""
$b = $b -replace '\s{0,1}([])}{([:;,])\s{0,1}', '$1'
# Test that they both do the same thing on my random downloaded sample file:
$b -eq $a
# Yep.
Do that again with another | to combine the two into one:
$c = gc d:\css\1.css -Raw
$c = $c -replace "\s{2,100}|[\n\t]|\s{0,1}([])}{([:;,])\s{0,1}", '$1'
$c -eq $a # also same output as your original.
NB. that the space and tab and newline capture nothing, so '$1' is empty,
which removes them.
And you can spend lots of time building your own unreadable regex which probably won't be noticeably faster in any real scenario. :)
NB. '$1' in the replacement, the dollar is a .Net regex engine syntax, not a PowerShell variable. If you use double quotes, PowerShell will string interpolate from the variable $1 and likely replace it with nothing.
You may join the patterns that are similar into 1 bigger expression with capturing groups, and use a callback inside a Regex replace method where you may evaluate the match structure and use appropriate action.
Here is a solution for your scenario that you may extend:
$callback = { param($match)
if ($match.Groups[1].Success -eq $true) { "" }
else {
if ($match.Groups[2].Success -eq $true) { $match.Groups[2].Value }
else {
if ($match.Groups[3].Success -eq $true) { $match.Groups[3].Value }
else {
if ($match.Groups[4].Success -eq $true) { $match.Groups[4].Value }
}
}
}
}
$path = "d:\input\folder\"
$file = "input_file.txt"
$a = [IO.File]::ReadAllText($path + $file)
$rx = [regex]'(\s{2,100}(?<!\S)|[\n\t])|\s+([{([])|([])}])\s+|([:;,])\s+'
$rx.Replace($a, $callback) | Out-File "d:\result\file.txt"
Pattern details:
(\s{2,100}(?<!\S)|[\n\t]) - Group 1 capturing 2 to 100 whitespaces not preceded with a non-whitespace char (maybe this lookbehind is redundant) OR a newline or tab char
| - or
\s+([{([]) - just matching one or more whitespaces (\s+), and then capturing into Group 2 any single char from the [{([] character class: {, ( or [
|([])}])\s+ - or Group 3 capturing any single char from the [])}] character class: }, ) or ] and then just matching one or more whitespaces
|([:;,])\s+ - or Group 4 capturing any char from [:;,] char class (:, ; or ,) and one or more whitespaces.

Powershell regex replacement expressions

I've followed the excellent solution in this article:
PowerShell multiple string replacement efficiency
to try and normalize telephone numbers imported from Active Directory. Here is an example:
$telephoneNumbers = #(
'+61 2 90237534',
'04 2356 3713'
'(02) 4275 7954'
'61 (0) 3 9635 7899'
'+65 6535 1943'
)
# Build hashtable of search and replace values.
$replacements = #{
' ' = ''
'(0)' = ''
'+61' = '0'
'(02)' = '02'
'+65' = '001165'
'61 (0)' = '0'
}
# Join all (escaped) keys from the hashtable into one regular expression.
[regex]$r = #($replacements.Keys | foreach { [regex]::Escape( $_ ) }) -join '|'
[scriptblock]$matchEval = { param( [Text.RegularExpressions.Match]$matchInfo )
# Return replacement value for each matched value.
$matchedValue = $matchInfo.Groups[0].Value
$replacements[$matchedValue]
}
# Perform replace over every line in the file and append to log.
$telephoneNumbers |
foreach {$r.Replace($_,$matchEval)}
I'm having problems with the formatting of the match expressions in the $replacements hashtable. For example, I would like to match all +61 numbers and replace with 0, and match all other + numbers and replace with 0011.
I've tried the following regular expressions but they don't seem to match:
'^+61'
'^+[^61]'
What am I doing wrong? I've tried using \ as an escape character.
I've done some re-arrangement of this, I'm not sure if it works for your whole situation but it gives the right results for the example.
I think the key is not to try and create one big regex from the hashtable, but rather to loop over it and check the values in it against the telephone numbers.
The only other change I made was moving the ' ','' replacement from the hash into the code that prints the replacement phone number, as you want this to run in every scenario.
Code is below:
$telephoneNumbers = #(
'+61 2 90237534',
'04 2356 3713'
'(02) 4275 7954'
'61 (0) 3 9635 7899'
'+65 6535 1943'
)
$replacements = #{
'(0)' = ''
'+61' = '0'
'(02)' = '02'
'+65' = '001165'
}
foreach ($t in $telephoneNumbers) {
$m = $false
foreach($r in $replacements.getEnumerator()) {
if ( $t -match [regex]::Escape($r.key) ) {
$m = $true
$t -replace [regex]::Escape($r.key), $r.value -replace ' ', '' | write-output
}
}
if (!$m) { $t -replace ' ', '' | write-output }
}
Gives:
0290237534
0423563713
0242757954
61396357899
00116565351943

Perl parsing JavaScript file regex, to catch quotes only at the beginning and end of the returned string

I'm just starting to learn Perl. I need to parse JavaScript file. I came up with the following subroutine, to do it:
sub __settings {
my ($_s) = #_;
my $f = $config_directory . "/authentic-theme/settings.js";
if ( -r $f ) {
for (
split(
'\n',
$s = do {
local $/ = undef;
open my $fh, "<", $f;
<$fh>;
}
)
)
{
if ( index( $_, '//' ) == -1
&& ( my #m = $_ =~ /(?:$_s\s*=\s*(.*))/g ) )
{
my $m = join( '\n', #m );
$m =~ s/[\'\;]//g;
return $m;
}
}
}
}
I have the following regex, that removes ' and ; from the string:
s/[\'\;]//g;
It works alright but if there is a mentioned chars (' and ;) in string - then they are also removed. This is undesirable and that's where I stuck as it gets a bit more complicated for me and I'm not sure how to change the regex above correctly to only:
Remove only first ' in string
Remove only last ' in string
Remove ont last ; in string if exists
Any help, please?
You can use the following to match:
^'|';?$|;$
And replace with '' (empty string)
See DEMO
Remove only first ' in string
Remove only last ' in string
^[^']*\K'|'(?=[^']*$)
Try this .See demo.
https://regex101.com/r/oF9hR9/8
Remove ont last ; in string if exists
;(?=[^;]*$)
Try this.See demo.
https://regex101.com/r/oF9hR9/9
All three in one
^[^']*\K'|'(?=[^']*$)|;(?=[^;]*$)
See Here
You can use this code:
#!/usr/bin/perl
$str = "'string; 'inside' another;";
$str =~ s/^'|'?;?$//g;
print $str;
IDEONE demo
The main idea is to use anchors: ^ beginning of string, $ end of string and ;? matches the ";" symbol at the end only if it is present (? quantifier is making the pattern preceding it optional).EDIT: Also, ; will get removed even if there is no preceding '.
I suggest that your original code should look more like this. It is much more idiomatic Perl and I think more straightforward to follow
sub __settings {
my ($_s) = #_;
my $file = "$config_directory/authentic-theme/settings.js";
return unless -r $file;
open my $fh, '<', $file or die qq{Unable to open "$file" for input: $!};
my #file = <$fh>;
chomp #file;
for ( #file ) {
next if m{//};
if ( my #matches = $_ =~ /(?:$_s\s*=\s*(.*))/g ) {
my $matches = join "\n", #matches;
$matches =~ tr/';//d;
return $matches;
}
}
}

Regex to capture words between a specific word

I'm trying to get a regex that matches: (It should not match any other string)
Word1 or Word2 or Word3 or Wordn
Capturing the words between before or after an "or"
1: Word1
2: Word2
3: Word3
n: Wordn
I've tried modifying a csv regex:
(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)
to
(?:^|(?:or)((?:[^(?:or)]+)*|[^(?:or)]*)
But that does not give me what I want.
I'm sure I'm missing something, but I've been banging my head for hours.
How about:
my $string = " foo or bar or foobar ";
if ( $string =~ m|^\s*[^\s]+(\s+or\s+[^\s]+)+\s*$| ) {
my $tmp = "$string";
$tmp =~ s|^\s+||;
$tmp =~ s|\s+$||;
my #words = split( /\s+or\s+/, $tmp );
printf( "Found %d words:\n", scalar( #words ) );
foreach my $word ( #words ) {
print( "\t'$word'\n" );
}
} else {
print( "No match\n" );
}
The above will output:
Found 3 words:
'foo'
'bar'
'foobar'
Try splitting the string on ' or '.
You know, this isn't something for which I'd naturally reach for regex. I'd try a split first.
my #words = split / or /, $string;
This regex will match any string that has at least word1 or word2, and any number more or's after that. It must have no whitespace at the beginning or end of the string as well, but you can remove the ^ and $ if you want to search for a string of this form within a larger string
(?:^(\w+)(?=\s+or))|(?:\s+or\s+(\w+))+
RegexPal
The real solution is to split on ' or '. A regex solution is not so straight forward.
$sm =~ / or / and #between_or = $sm =~ /(?:^\s*|(?<= or ))(.+?)(?= or |\s*$)/sg;

How can I count the amount of spaces at the start of a string in Perl?

How can I count the amount of spaces at the start of a string in Perl?
I now have:
$temp = rtrim($line[0]);
$count = ($temp =~ tr/^ //);
But that gives me the count of all spaces.
$str =~ /^(\s*)/;
my $count = length( $1 );
If you just want actual spaces (instead of whitespace), then that would be:
$str =~ /^( *)/;
Edit: The reason why tr doesn't work is it's not a regular expression operator. What you're doing with $count = ( $temp =~ tr/^ // ); is replacing all instances of ^ and with itself (see comment below by cjm), then counting up how many replacements you've done. tr doesn't see ^ as "hey this is the beginning of the string pseudo-character" it sees it as "hey this is a ^".
You can get the offset of a match using #-. If you search for a non-whitespace character, this will be the number of whitespace characters at the start of the string:
#!/usr/bin/perl
use strict;
use warnings;
for my $s ("foo bar", " foo bar", " foo bar", " ") {
my $count = $s =~ /\S/ ? $-[0] : length $s;
print "'$s' has $count whitespace characters at its start\n";
}
Or, even better, use #+ to find the end of the whitespace:
#!/usr/bin/perl
use strict;
use warnings;
for my $s ("foo bar", " foo bar", " foo bar", " ") {
$s =~ /^\s*/;
print "$+[0] '$s'\n";
}
Here's a script that does this for every line of stdin. The relevant snippet of code is the first in the body of the loop.
#!/usr/bin/perl
while ($x = <>) {
$s = length(($x =~ m/^( +)/)[0]);
print $s, ":", $x, "\n";
}
tr/// is not a regex operator. However, you can use s///:
use strict; use warnings;
my $t = (my $s = " \t\n sdklsdjfkl");
my $n = 0;
++$n while $s =~ s{^\s}{};
print "$n \\s characters were removed from \$s\n";
$n = ( $t =~ s{^(\s*)}{} ) && length $1;
print "$n \\s characters were removed from \$t\n";
Since the regexp matcher returns the parenthesed matches when called in a list context, CanSpice's answer can be written in a single statement:
$count = length( ($line[0] =~ /^( *)/)[0] );
This prints amount of white space
echo " hello" |perl -lane 's/^(\s+)(.*)+$/length($1)/e; print'
3