How to replace consecutive and identical characters in Perl? - regex

I have a string like
XXXXYYYYZZZYYZZZYYYY which needs to be converted to
XXXXAAAYZZZAYZZZAAAY
$s =~ s/Y{2}+/AY/g;
this has 2 problems, {2}+ will get YYYY to AYAY; and AY is not the same length as YYYY (expecting AAAY)
How to get this done in perl?

Use a "look-ahead":
$s =~ s/Y(?=Y+)/A/g;
(?=Y+) means "followed by one or more Y characters", so any Y character that is followed by another Y character will be replaced with an A.
More info from perlretut

There's always more than one way to do it. My suggestion is to grab all the Ys except the last one, and then use that to create a string of As of the same length. The e modifier tells perl to execute the code in the replacement side instead of using it directly, and the r modifier tells =~ to return the result of the substitution instead of modifying the input text directly (useful for these one-liner tests, among other places).
$ perl -E 'say shift =~ s/(Y+)(?=Y)/"A"x length$1/gre' XXXXYYYYZZZYYZZZYYYY
XXXXAAAYZZZAYZZZAAAY

$s =~ s/Y{2}+/AY/g
RHS Pattern is ambiguously obscure pattern: Y{2}+, that's very rarely used regex pattern except if {}+ very rarely is available in few advanced regex engine, including perl maybe, as a regex feature called 'atomic grouping'.
You might have meant (Y{2})+ which is (YY)+ or Y{2,} which is YY+
in perl it's no brainer simple and easy as it supports lookaround feature
perl -e '$s=XXXXYYYYZZZYYZZZYYYY ;$s =~ s/Y(?=Y)/A/g;print $s'
actually lower regex engine such sed still can do it albeit in cumbersome, uneasy way
echo XXXXYYYYZZZYYZZZYYYY |sed -E 's/YY+/&\n/g;s/Y/A/g;s/A\n/Y/g'

Related

Complex regex - works in Powershell, not in Bash

The below code is a small portion of my code for Solarwinds to parse the output of a Netbackup command. This is fine for our Windows boxes but some of our boxes are RHEL.
I'm trying to convert the below code into something useable on RHEL 4.X but I'm running into a wall with parsing the regex. Obviously the below code has some of the characters escaped for use with Powershell, I have unescaped those characters for use with Shell.
I'm not great with Shell yet, but I will post a portion of my Shell code below the Powershell code.
$output = ./bpdbjobs
$Results = #()
$ColumnName = #()
foreach ($match in $OUTPUT) {
$matches = $null
$match -match "(?<jobID>\d+)?\s+(?<Type>(\b[^\d\W]+\b)|(\b[^\d\W]+\b\s+\b[^\d\W]+\b))?\s+(?<State>(Done)|(Active)|(\w+`-\w+`-\w+))?\s+(?<Status>\d+)?\s+(?<Policy>(\w+)|(\w+`_\w+)|(\w+`_\w+`_\w+))?\s+(?<Schedule>(\b[^\d\W]+\b\-\b[^\d\W]+\b)|(\-)|(\b[^\d\W]+\b))?\s+(?<Client>(\w+\.\w+\.\w+)|(\w+))?\s+(?<Dest_Media_Svr>(\w+\.\w+\.\w+)|(\w+))?\s+(?<Active_PID>\d+)?\s+(?<FATPipe>\b[^\d\W]+\b)?"
$Results+=$matches
}
The below is a small portion of Shell code I've written (which is clearly very wrong, learning as I go here). I'm just using this to test the Regex and see if it functions in Shell - (Spoiler alert) it does not.
#!/bin/bash
#
backups=bpdbjobs
results=()
for results in $backups; do
[[ $results =~ /(?<jobID>\d+)?\s+(?<Type>(\b[^\d\W]+\b)|(\b[^\d\W]+\b\s+\b[^\d\W]+\b))?\s+(?<State>(Done)|(Active)|(\w+\w+\-\w\-+))?\s+(?<Status>\d+)?\s+(?<Policy>(\w+)|(\w+\_\w+)|(\w+\_\w+\_\w+))?\s+(?<Schedule>(\b[^\d\W]+\b\-\b[^\d\W]+\b)|(\-)|(\b[^\d\W]+\b))?\s+(?<Client>(\w+\.\w+\.\w+)|(\w+))?\s+(?<Dest_Media_Svr>(\w+\.\w+\.\w+)|(\w+))?\s+(?<Active_PID>\d+)?/ ]]
done
$results
Below are the errors I get.
./netbackupsolarwinds.sh: line 9: syntax error in conditional expression: unexpected token `('
./netbackupsolarwinds.sh: line 9: syntax error near `/(?'
./netbackupsolarwinds.sh: line 9: ` [[ $results =~ /(?<jobID>\d+)?\s+(?<Type>(\b[^\d\W]+\b)|(\b[^\d\W]+\b\s+\b[^\d\W]+\b))?\s+(?<State>(Done)|(Active)|(\w+\w+\-\w\-+))?\s+(?<Status>\d+)?\s+(?<Policy>(\w+)|(\w+\_\w+)|(\w+\_\w+\_\w+))?\s+(?<Schedule>(\b[^\d\W]+\b\-\b[^\d\W]+\b)|(\-)|(\b[^\d\W]+\b))?\s+(?<Client>(\w+\.\w+\.\w+)|(\w+))?\s+(?<Dest_Media_Svr>(\w+\.\w+\.\w+)|(\w+))?\s+(?<Active_PID>\d+)?/ ]]'
From man bash:
An additional binary operator, =~, is available, with the same precedence as == and !=. When it is used, the string to the right of the operator is considered an extended regular expression and matched accordingly (as in regex(3)).
Meaning that the expression is parsed as a POSIX extended regular expression, which AFAIK does not support either named capturing groups ((?<name>...)) or character escapes (\d, \w, \s, ...).
If you want to use [[ $var =~ expr ]] you need to rewrite the regular expression. Otherwise use grep (which supports PCRE):
grep -P '(?<jobID>\d+)?\s+...' <<<$results
Updated answer, after comments exchange.
The best way to perform your migration quickly is to use the --perl-regexp Perl compatibility option of Grep, like eventually suggested in another answer.
If you still want to perform this operation with pure Bash, you need to rewrite the regular expression accordingly, following the documentation.
Thanks all for the answers. I swapped to Grep -P to no avail, turns out the named capture groups were the problem for Grep -P.
I was also unable to figure out a way to use Grep to output the capture group matches to individual variables.
This lead me to swap over to using perl, as follows, with alterations to my regex.
bpdbjobs | perl -lne 'print "$1" if /(\d+)?\s+((\b[^\d\W]+\b)|(\b[^\d\W]+\b\s+\b[^\d\W]+\b))?\s+((Done)|(Active)|(\w+\w+\-\w\-+))?\s+(\d+)?\s+((\w+)|(\w+\_\w+)|(\w+\_\w+\_\w+))?\s+((b[^\d\W]+\b\-\b[^\d\W]+\b)|(\-)|(\b[^\d\W]+\b))?\s+((\w+\.\w+\.\w+)|(\w+))?\s+((\w+\.\w+\.\w+)|(\w+))?\s+(\d+)?/g'
With $<num> referring to the capture group number. I can now list, display and (the important part) count the number of matches within an individual group, corresponding to the data found in each column.

using the command line and regex to determine words that start sentences

I have the text:
This is a test. This is only a test! If there were an emergency, then Information would be provided for you.
I want to be able to determine which words start sentences. What I have now is:
$ cat <FILE> | perl -pe 's/[\s.?!]/\n/g;'
This just gets rid of punctuation and replaces them with newlines, giving me:
This
is
a
test
This
is
only
a
test
If
there
were
an
emergency,
then
Information
would
be
provided
for
you
From here I could somehow extract the words that have either nothing above them (start of file) or a blank space, but I am unsure of exactly how to do this.
If you have a Perl of at least version 5.22.1 (or 5.22.0 and this case is not affected by the bug described here), then you can use the sentence boundaries in your regular expression.
use feature 'say';
foreach my $sentence (m/\b{sb}(\w+)/g) {
say $sentence;
}
Or, as a one-liner:
perl -nE 'say for /\b{sb}(\w+)/g'
If called with your example text, the output is:
This
This
If
It uses \b{sb}, which is the sentence boundary. You can read a tutorial at brian d foy's blog about it. The \b{} is called a unicode boundary and is described in perlrebackslash.
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
local $/;
my #words = <DATA> =~ m/(?:^|[\.!]+)\s+(\w+)/g;
print Dumper \#words;
__DATA__
This is a test. This is only a test! If there were an emergency, then Information would be provided for you.
So as a command line:
perl -ne 'print join "\n", m/(?:^|[\.!])\s+(\w+)/g;' somefile
You can use this gnu grep command to extract first after each period or ! or ?:
grep -oP '(?:^|[.?!])\s*\K[A-Z][a-z]+' file
This
This
If
Though I must caution you may get false results for cases like Mr. Smith.
Regex Breakup:
(?:^|[.?!]) - match start or DOT or ! or ?
\s* - match 0 or more whitespaces
\K - match reset to forget matched data
[A-Z][a-z]+ - match a word startign with upper case letter

Swapping letters with regexp

How can I swap the letter o with the letter e and e with o?
I just tried this but I don't think this is a good way of doing this. Is there a better way?
my $str = 'Absolute force';
$str =~ s/e/___eee___/g;
$str =~ s/o/e/g;
$str =~ s/___eee___/o/g;
Output: Abseluto ferco
Use the transliteration operator:
$str =~ y/oe/eo/;
E.g.
$ echo "Absolute force" | perl -pe 'y/oe/eo/'
Abseluto ferco
As has already been said, the way to do this is the transliteration operator
tr/SEARCHLIST/REPLACEMENTLIST/cdsr
y/SEARCHLIST/REPLACEMENTLIST/cdsr
Transliterates all occurrences of the characters found in the search list with the corresponding character in the replacement list. It returns the number of characters replaced or deleted. If no string is specified via the =~ or !~ operator, the $_ string is transliterated.
However, I want to commend you on your creative use of regular expressions. Your solution works, although the placeholder string _ee_ would've been sufficient.
tr is only going to help you for character replacements though, so I'd like to quickly teach you how to utilize regular expressions for a more complicated mass replacement. Basically, you just use the /e tag to execute code in the RHS. The following will also do the replacement you were aiming for:
my $str = 'Absolute force';
$str =~ s/([eo])/$1 eq 'e' ? 'o' : 'e'/eg;
print $str;
Outputs:
Abseluto ferco
Note how the LHS (left hand side) matches both o and e, and them the RHS (right hand side) does a test to see which matched and returns the opposite for replacement.
Now, it's common to have a list of words that you want to replace, so it's convenient to just build a hash of your from/to values and then dynamically build the regular expression. The following does that:
my $str = 'Hello, foo. How about baz? Never forget bar.';
my %words = (
foo => 'bar',
bar => 'baz',
baz => 'foo',
);
my $wordlist_re = '(?:' . join('|', map quotemeta, keys %words) . ')';
$str =~ s/\b($wordlist_re)\b/$words{$1}/eg;
Outputs:
Hello, bar. How about foo? Never forget baz.
This above could've worked for your e and o case, as well, but would've been overkill. Note how I use quotemeta to escape the keys in case they contained a regular expression special character. I also intentionally used a non-capturing group around them in $wordlist_re so that variable could be dropped into any regex and behave as desired. I then put the capturing group inside the s/// because it's important to be able to see what's being captured in a regex without having to backtrack to the value of an interpolated variable.
The tr/// operator is best. However, if you wanted to use the s/// operator (to handle more than just single letter substitutions), you could write
$ echo 'Absolute force' | perl -pe 's/(e)|o/$1 ? "o" : "e"/eg'
Abseluto ferco
The capturing parentheses avoid the redundant $1 eq 'e' test in #Miller's answer.
from man sed:
y/source/dest/
Transliterate the characters in the pattern space which appear in source to the corresponding character in dest.
and tr command can do this too:
$ echo "Absolute force" | tr 'oe' 'eo'
Abseluto ferco

Powershell - Replacing a string with a variable ending with a dollar sign

I'm a bit lost with this one. For whatever reason the replace function in powershell doesn't play well with variables ending with a $ sign.
Command:
$var='A#$A#$'
$line=('$var='+"'"+"'")
$line -replace '^.+$',('$line='+"'"+$var+"'")
Expected output:
$line='A#$A#$'
Actual output:
$line='A#$A#
It looks like you're getting hit with a regex substitution that you don't want. The regex special variable $' represents everything after your match. Since your regex matches the entire string, $' is effectively empty. During the replace operation, the .Net regex engine sees $' in your expected output and substitutes in that empty string.
One way to avoid this is to replace all instances of $ in your $var string with $$:
$line -replace '^.+$',('$line='+"'"+($var.Replace('$','$$'))+"'")
You can see more information about regex substitution in .Net here:
Substitutions
I was able to find a band-aid of sorts by replacing $ with a special character and then reverting it back after the change. Preferably you would choose a character that doesn't have a key on your keyboard. For me I chose "¤".
$var='A#$A#$'
$var=$var -replace '\$','¤'
$line=("`$var=''")
$line -replace '^.+$',("`$line='$var'") -replace '¤','$'
I don't really understand the purpose of your posted lines, it seems to me that it would just make more sense to do $line='$line='''+$var+"'", BUT if you insist on your way, just do two replace calls, like this:
$line -replace '^.+$',('$line=''LOL''') -replace 'LOL',$var

What does s-/-- and s-/\Z-- in perl mean?

I am a beginner in perl and I have a query regarding pattern matching.
I came across a line in perl where it was written
$variable =~ s-/\Z--;
And as the code goes ahead some another variable was assigned
$variable1 =~ s-/--;
Can you please tell me what does these 2 lines do?
I want to know what does s-/\Z-- and s-/-- mean.
$variable =~ s-/\Z--;
- is used as a delimiter here. However, best practice suggests that you either use / or {} as delimiters.
It could be re-written as:
$variable =~ s{/\Z}{}; # remove a / at the end of a string
Consider:
$variable1 =~ s-/--;
Again, it could be re-written as:
$variable1 =~ s{/}{}; # remove the first /
The s/// operator in Perl is a substitution operation, which performs a search-and-replace on a string using a special kind of pattern called a regular expression. You can read more about regular expressions and Perl's pattern matching in the man pages that come with Perl:
man perlretut
man perlre
If you don't have these on your system, try searching Google for the same.
Applying a substitution to a variable is done with the =~ operator. So the following replaces all instances of 'foo' in the variable $var with 'bar'.
$var =~ s/foo/bar/;
All the Perl operators are documented on the 'perlop' man page.
Even though the most common separator character is a slash (hence s///), you can also use any other punctuation character as a separator. So in this case, the author has decided to use the dash (-) as the separator.
Here's the same line of code above using dash as a separator:
$var =~ s-foo-bar-;
In your case, the dash doesn't seem to add any clarity to the code, so it might be best to update it to use the conventional slashes instead.
The s/// search and replace function in perl can be used with different delimeters, which is what is done in this case. They have replaced / with the minus sign -, or dash.
The s-/-- removes the first / from the string.
The s-/\Z-- matches and removes a slash at the end of the line. I think this is better written: s{/$}{}.
$variable1 =~ s-/--;could be written as
$variable =~ s{/}{}xms;
or this
$variable =~ s/ \/ //xms;
It means delete the first / in the string.
Regarding s-/\Z--, it is usually written like this
$variable =~ s{/ \Z}{}xms;
or this
$variable =~ s/ \/ \Z //xms;
It means delete a / if it is at the end of the string (\Z).