Multi platform script perl or awk - regex

I am trying to match records in following format:
(-,username,domain1.co.uk)\
(-,username,domain2.co.uk)
either awk or perl must be used. I am using cygwin and wrote following code which works and matches both above entries:
awk 'BEGIN {musr="(-,username,[^)]+.co.uk)"} {if ($0~musr) print $0}' netgroup
But if I try to modify this regexp to be more specific the output is nothing:
1st: match record then last backslash and then match newline:
"(-,username,[^)]+.co.uk)\\$"
2nd: match new line immediatelly after record without backslash:
"(-,username,[^)]+.co.uk)$"
So I decided to rewrite script into perl, hoping that perl can deal with backslashes and end of line symbols. For this purpose I used a2p this way:
echo 'BEGIN {musr="(-,username,[^)]+.co.uk)"} {if ($0~musr) print $0}' | a2p.exe
#!/usr/bin/perl
eval 'exec /usr/bin/perl -S $0 ${1+"$#"}'
if $running_under_some_shell;
# this emulates #! processing on NIH machines.
# (remove #! line above if indigestible)
eval '$'.$1.'$2;' while $ARGV[0] =~ /^([A-Za-z_0-9]+=)(.*)/ && shift;
# process any FOO=bar switches
$, = ' '; # set output field separator
$\ = "\n"; # set output record separator
$musr = '(-,username,[^)]+.co.uk)';
while (<>) {
chomp; # strip record separator
if ($_ =~ $musr) {
print $_;
}
}
This generated perl script also matches both entries, however if I try modify this script to more specific I get the following errors:
1st:
$musr = "(-,username,[^)]+.co.uk)\\";
Trailing \ in regex m/(-,username,[^)]+.co.uk)\/ at perlmatch.pl line 18, <> line 1.
2nd:
$musr = "(-,username,[^)]+.co.uk)$";
Final $ should be \$ or $name at perlmatch.pl line 14, within string
syntax error at perlmatch.pl line 14, near "= "(-,username,[^)]+.co.uk)$""
Execution of perlmatch.pl aborted due to compilation errors.
3rd:
$musr = "(-,username,[^)]+.co.uk)\$";
[the output is nothing]
What I am doing wrong ? My question is also pointing to fact that if somebody needs to use script on several platforms (aix, solaris, linux) than using perl should be better approach that dealing with (non)GNU utils and various (g|n)awk versions etc. Regards

Your problems arise from string quoting in Perl.
$musr = "(-,username,[^)]+.co.uk)\\"; replaces \\ with a single backslash when the string is created. But you would need to pass two backslashes to the regex. So you would have to put four in when you create the string.
$musr = "(-,username,[^)]+.co.uk)$"; tries to perform variable interpolation within the string.
In addition, parentheses should be escaped, as John Kugelman noted.
The solution is to use Perl's built-in delimiters for regular expressions, rather than normal quoted strings. The simple way is to put it right into your loop:
while (<>) {
chomp; # strip record separator
if ($_ =~ /\(-,username,[^)]+.co.uk\)$/) {
print $_;
}
}
If you do need to put the pattern into a variable first, use the special qr//
operator.
my $musr = qr/\(-,username,[^)]+.co.uk\)$/;
while (<>) {
chomp; # strip record separator
if ($_ =~ $musr) {
print $_;
}
}

(-,username,[^)]+.co.uk)\\$
The problem here is not with the backslash at the end of the line, it's the parentheses. Parentheses are used for grouping. You need to escape them to match literal ( ) characters. You should also escape the dots so they match literal dots instead of "any character".
$ awk '/\(-,username,[^)]+\.co\.uk\)$/ {print}' netgroup
(-,username,domain2.co.uk)
$ awk '/\(-,username,[^)]+\.co\.uk\)\\$/ {print}' netgroup
(-,username,domain1.co.uk)\
If you stick with plain awk and don't use [gn]awk-specific features awk is very portable. More portable than perl is, I would think.

Parentheses must be escaped. Otherwise they group expressions. To be more specific, match an optional backslash at the end of the line (Backslashes are doubled because as string they must be escaped too).
awk 'BEGIN {musr="\\(-,username,[^)]+.co.uk\\)\\\\?$"} {if ($0~musr) print $0}' netgroup

Related

perl match consecutive newlines: `echo "aaa\n\n\nbbb" | perl -pe "s/\\n\\n/z/gm"`

This works:
echo "aaa\n\n\nbbb" | perl -pe "s/\\n/z/gm"
aaazzzbbbz
This doesn't match anything:
echo "aaa\n\n\nbbb" | perl -pe "s/\\n\\n/z/gm"
aaa
bbb
How do I fix, so the regex matches two consecutive newlines?
A linefeed is matched by \n
echo "a\n\n\b" | perl -pe's/\n/z/'
This prints azzb, and without the following newline, so with the next prompt on the same line. Note that the program is fed one line at a time so there is no need for /g modifier. (And which is why \n\n doesn't match.) That /m modifier is then unrelated to this example.†
I don't know in what form this is used but I'd imagine not with echo feeding the input? Then better test it with input in a file, or in a multi-line string (in which case /g may be needed).
An example
use warnings;
use strict;
use feature 'say';
# Test with multiline string
my $ml_str = "a\n\nb\n";
$ml_str =~ s/\n/z/g; #--> azzbz (no newline at the end)
print $ml_str;
say ''; # to terminate the line above
# Or to replace two consecutive newlines (everywhere)
$ml_str = "a\n\nb\n"; # restore the example string
$ml_str =~ s/\n\n/z/g; #--> azb\n
print $ml_str;
# To replace the consecutive newlines in a file read it into a string
my $file = join '', <DATA>; # lines of data after __DATA__
$file =~ s/\n\n/z/g;
print $file;
__DATA__
one
two
last
This prints
azzbz
azb
one
twoz
last
As a side note, I'd like to mention that with the modifier /s the . matches a newline as well. (For example, this is handy for matching substrings that may contain newlines by .* (or .+); without /s modifier that pattern stops at a newline.)
See perlrebackslash and search for newline.
† The /m modifier makes ^ and $ also match beginning and end of lines inside a multi-line string. Then
$multiline_string =~ s/$/z/mg;
will replace newlines inside the string. However, this example bears some complexities since some of the newlines stay.
You are applying substitution to only one line at a time, and one line will never have two newlines. Apply the substitution to the entire file instead:
perl -0777 -pe 's/\n\n/z/g'

Repeating regex pattern

I have a string such as this
word <gl>aaa</gl> word <gl>aaa-bbb=ccc</gl>
where, if there is one ore more words enclosed in tags. In those instances where there are more than one words (which are usually separated by - or = and potentially other non-word characters), I'd like to make sure that the tags enclose each word individually so that the resulting string would be:
word <gl>aaa</gl> word <gl>aaa</gl>-<gl>bbb</gl>=<gl>ccc</gl>
So I'm trying to come up with a regex that would find any number of iterations of \W*?(\w+) and then enclose each word individually with the tags. And ideally I'd have this as a one-liner that I can execute from the command line with perl, like so:
perl -pe 's///g;' in out
This is how far I've gotten after a lot of trial and error and googling - I'm not a programmer :( ... :
/<gl>\W*?(\w+)\W*?((\w+)\W*?){0,10}<\/gl>/
It finds the first and last word (aaa and ccc). Now, how can I make it repeat the operation and find other words if present? And then how to get the replacement? Any hints on how to do this or where I can find further information would be much appreciated?
EDIT:
This is part of a workflow that does some other transformations within a shell script:
#!/bin/sh
perl -pe '#
s/replace/me/g;
s/replace/me/g;
' $1 > tmp
... some other commands ...
This needs a mini nested-parser and I'd recommend a script, as easier to maintain
use warnings;
use strict;
use feature 'say';
my $str = q(word <gl>aaa</gl> word <gl>aaa-bbb=ccc</gl>);
my $tag_re = qr{(<[^>]+>) (.+?) (</[^>]+>)}x; # / (stop markup highlighter)
$str =~ s{$tag_re}{
my ($o, $t, $c) = ($1, $2, $3); # open (tag), text, close (tag)
$t =~ s/(\w+)/$o$1$c/g;
$t;
}ge;
say $str;
The regex gives us its built-in "parsing," where words that don't match the $tag_re are unchanged. Once the $tag_re is matched, it is processed as required inside the replacement side. The /e modifier makes the replacement side be evaluated as code.
One way to provide input for a script is via command-line arguments, available in #ARGV global array in the script. For the use indicated in the question's "Edit" replace the hardcoded
my $str = q(...);
with
my $str = shift #ARGV; # first argument on the command line
and then use that script in your shell script as
#!/bin/sh
...
script.pl $1 > output_file
where $1 is the shell variable as shown in the "Edit" to the question.
In a one-liner
echo "word <gl>aaa</gl> word <gl>aaa-bbb=ccc</gl>" |
perl -wpe'
s{(<[^>]+>) (.+?) (</[^>]+>)}
{($o,$t,$c)=($1,$2,$3);$t=~s/(\w+)/$o$1$c/g; $t}gex;
'
what in your shell script becomes echo $1 | perl -wpe'...' > output_file. Or you can change the code to read from #ARGV and drop the -n switch, and add a print
#!/bin/sh
...
perl -wE'$_=shift; ...; say' $1 > output_file
where ... in one-liner indicate the same code as above, and say is now needed since we don't have the -p with which the $_ is printed out once it's processed.
The shift takes an element off of an array's front and returns it. Without an argument it does that to #ARGV when outside a subroutine, as here (inside a subroutine its default target is #_).
This will do it:
s/(\w+)([\-=])(?=\w+)/$1<\/gl>$2<gl>/g;
The /g at the end is the repeat and stands for "global". It will pick up matching at the end of the previous match and keep matching until it doesn't match anymore, so we have to be careful about where the match ends. That's what the (?=...) is for. It's a "followed by pattern" that tells the repeat to not include it as part of "where you left off" in the previous match. That way, it picks up where it left off by re-matching the second "word".
The s/ at the beginning is a substitution, so the command would be something like:
cat in | perl -pne 's/(\w+)([\-=])(?=\w+)/$1<\/gl>$2<gl>/g;$_' > out
You need the $_ at the end because the result of the global substitution is the number of substitutions made.
This will only match one line. If your pattern spans multiple lines, you'll need some fancier code. It also assumes the XML is correct and that there are no words surrounding dashes or equals signs outside of tags. To account for this would necessitate an extra pattern match in a loop to pull out the values surrounded by gl tags so that you can do your substitution on just those portions, like:
my $e = $in;
while($in =~ /(.*?<gl>)(.*?)(?=<\/gl>)/g){
my $p = $1;
my $s = $2;
print($p);
$s =~ s/(\w+)([\-=])(?=\w+)/$1<\/gl>$2<gl>/g;
print($s);
$e = $'; # ' (stop markup highlighter)
}
print($e);
You'd have to write your own surrounding loop to read STDIN and put the lines read in into $in. (You would also need to not use -p or -n flags to the perl interpreter since you're reading the input and printing the output manually.) The while loop above however grabs everything inside the gl tags and then performs your substitution on just that content. It prints everything occurring between the last match (or the beginning of the string) and before the current match ($p) and saves everything after in $e which gets printed after the last match outside the loop.

regular expression that matches any word that starts with pre and ends in al

The following regular expression gives me proper results when tried in Notepad++ editor but when tried with the below perl program I get wrong results. Right answer and explanation please.
The link to file I used for testing my pattern is as follows:
(http://sainikhil.me/stackoverflow/dictionaryWords.txt)
Regular expression: ^Pre(.*)al(\s*)$
Perl program:
use strict;
use warnings;
sub print_matches {
my $pattern = "^Pre(.*)al(\s*)\$";
my $file = shift;
open my $fp, $file;
while(my $line = <$fp>) {
if($line =~ m/$pattern/) {
print $line;
}
}
}
print_matches #ARGV;
A few thoughts:
You should not escape the dollar sign
The capturing group around the whitespaces is useless
Same for the capturing group around the dot .
which leads to:
^Pre.*al\s*$
If you don't want words like precious final to match (because of the middle whitespace, change regex to:
^Pre\S*al\s*$
Included in your code:
while(my $line = <$fp>) {
if($line =~ /^Pre\S*al\s*$/m) {
print $line;
}
}
You're getting messed up by assigning the pattern to a variable before using it as a regex and putting it in a double-quoted string when you do so.
This is why you need to escape the $, because, in a double-quoted string, a bare $ indicates that you want to interpolate the value of a variable. (e.g., my $str = "foo$bar";)
The reason this is causing you a problem is because the backslash in \s is treated as escaping the s - which gives you just plain s:
$ perl -E 'say "^Pre(.*)al(\s*)\$";'
^Pre(.*)al(s*)$
As a result, when you go to execute the regex, it's looking for zero or more ses rather than zero or more whitespace characters.
The most direct fix for this would be to escape the backslash:
$ perl -E 'say "^Pre(.*)al(\\s*)\$";'
^Pre(.*)al(\s*)$
A better fix would be to use single quotes instead of double quotes and don't escape the $:
$ perl -E "say '^Pre(.*)al(\s*)$';"
^Pre(.*)al(\s*)$
The best fix would be to use the qr (quote regex) operator instead of single or double quotes, although that makes it a little less human-readable if you print it out later to verify the content of the regex (which I assume to be why you're putting it into a variable in the first place):
$ perl -E "say qr/^Pre(.*)al(\s*)$/;"
(?^u:^Pre(.*)al(\s*)$)
Or, of course, just don't put it into a variable at all and do your matching with
if($line =~ m/^Pre(.*)al(\s*)$/) ...
Try removing trailing newline character(s):
while(my $line = <$fp>) {
$line =~ s/[\r\n]+$//s;
And, to match only words that begin with Pre and end with al, try this regular expression:
/^Pre\w*al$/
(\w means any letter of a word, not just any character)
And, if you want to match both Pre and pre, do a case-insensitive match:
/^Pre\w*al$/i

Matching string in perl which is parsed with awk

I did next with awk:
awk '/secon/ {print $1}' vladtest.sh |cut -c2-6
How to match this string in perl loop on appropriate way:
my $var1= `awk '/secon/ {print $1}' vladtest.sh |cut -c2-6`
if ($var1 eq "secon"){
print "OK";
} else {
print "FALSE"
}
First of all, your problem is the fact that the result of `...` includes the newline. So $var1 is not "secon", but "secon\n". You could deal with that any number of ways - wrapping chomp(...) around the whole assignment to $var1, or changing the right side of the eq to include the "\n", or using a regex instead: if ($var1 =~ /^secon$/) works with or without the chomp, because $ matches at a string-final newline if present.
Second of all, you're using about two programs too many here:
my $var1= `awk '/secon/ {print $1}' vladtest.sh |cut -c2-6`
I mean,awk can do anything cut can do, and perl can do anything either can do. When I'm typing one-liners at the shell prompt, I'm a big fan of awk and cut, simply for their economy of expression; within their respective specialities, their command lines are generally shorter than the equivalent (even using perl -a etc). But since you're already in a Perl program, there's no reason to shell out to something else here.
Here's an example drop-in replacement for the above:
my $var1;
{
open(my $fh, '<', 'vladtest.sh');
while (<$fh>) {
$var1 .= substr(split(' ')[0],1,5)."\n" if /secon/;
}
}
But you don't need to go through all that if you just want to detect if there's a match.
{open(my $fh, '<', 'vladtest.sh');
print 'NOT ' unless grep { /^\s*\Ssecon/ } <$fh>;
say 'OK';} # or print "OK\n"; if you don't have 5.10 extensions enabled.
My awk's a bit rusty, but you're grabbing $1 then looking for characters 2 through 6 to match 'secon'. If any line matches this, then print "OK", otherwise "NOT OK". Here's a perl only solution:
use 5.10.0;
while (<>) {
if (/^\s*\Ssecon/) {
say 'OK';
exit 0;
}
}
say 'NOT OK';
exit 1;
The regexp is made up of:
^ match the start of each line, followed by
\s* zero or more whitespace characters, followed by
\S one non-whitespace character, followed by
secon the literal string that you're interested in
As soon as we have a line that matches, we can print 'OK' and exit. Otherwise we'll fall through the end of the loop and print 'NOT OK'.
The use 5.10.0; line is needed so you can use say().
There are better ways to implement this request, but follow your current idea , change the first line to resolve the issue.
my $var1= `awk '$1~/secon/ {print substr($1,2,5)}' vladtest.sh`

How to make a perl one-liner "line-endings agnostic"

I have scratched my head for one hour on a perl oneliner failing because the file had CRLF line endings. It has a regex with group match at the end of the line, and the CR got included in the match, making bad stuff with using the backreference for replace.
I ended up specifying the CRLF manually in the regex, but is there a way to get perl handle automatically line-ending whatever they are?
Original command is
perl -pe 's/foo bar(.*)$/foo $1 bar/g' file.txt
"Correct" command is
perl -pe 's/foo bar(.*)\r\n/foo $1 bar\r\n/g' file.txt
I know I can also convert line endings before processing, I'm interested in how to get Perl handle this case gracefully.
Example file (save with CRLF line endings!)
[19:06:57.033] foo barmy
[19:06:57.033] foo baryour
Expected output
[19:06:57.033] foo my bar
[19:06:57.033] foo your bar
Output with original command (bar goes at line beginning because it's matched together with carriage return):
bar:06:57.033] foo my
bar:06:57.033] foo your
First of all, let's keep in mind that
perl -ple's/foo bar(.*)\z/foo $1 bar/g' file.txt
is short for something close to
perl -e'
while (<>) {
chomp;
s/foo bar(.*)\z/foo $1 bar/g;
print $_, $/;
}
' file.txt
Perl makes it so code can read/write local text files in a platform independent manner.
In a comment, you asked how to read/write both local text files and foreign text files in a platform independent manner.
First, you'll have to disable Perl's normal handling.
binmode STDIN;
binmode STDOUT;
Then you'll have to handle the multiple line endings.
sub mychomp { (#_ ? $_[0] : $_) =~ s/(\s*)\z//; $1 }
while (<STDIN>) {
my $le = mychomp($_);
s/foo bar(.*)\z/foo $1 bar/g;
print($_, $le);
}
So instead of
perl -ple's/foo bar(.*)\z/foo $1 bar/g' file.txt
you would have
perl -e'
sub mychomp { (#_ ? $_[0] : $_) =~ s/(\s*)\z//; $1 }
binmode STDIN;
binmode STDOUT;
while (<STDIN>) {
my $le = mychomp($_);
s/foo bar(.*)\z/foo $1 bar/g;
print($_, $le);
}
' <file
In newer perls, you can use \R in your regex to strip off all end-of-line characters (it includes both \n and \r). See perldoc perlre.
The \R escape sequence Perl v5.10+; see perldoc rebackslash or the documentation online, which matches "generic newlines" (platform-agnostically) can be made to work here (example uses Bash to create the multi-line input string):
$ printf 'foo barmy\r\nfoo baryour\r\n' | perl -pe 's/foo bar(.*?)\R/foo $1 bar\n/gm'
foo my bar
foo your bar
Note that the only difference to Ether's answer is use of a non-greedy construct (.*? rather than just .*), which makes all the difference here.
Read on, if you want to know more.
Background:
It is an example of a pitfall associated with \R, which stems from the fact that it can match one or two characters - either \r\n or, typically, \n:[1]
With the greedy (.*) construct , "my\r" - including the \r - is captured, because the regex engine apparently only backtracks by one character to look for \R, which the remaining \n by itself also satisfies.
By contrast, using the non-greedy (.*?) construct causes \R to match the \r\n sequence, as intended.
[1] \R matches MORE than just \r\n and \n: it matches any single character that is classified as vertical whitespace in Unicode terms, which also includes \v (vertical tab), \f (form feed), \r (by itself), and the following Unicode chars: 0x133 (NEXT LINE), 0x2028 (LINE SEPARATOR), 0x8232 (LINE SEPARATOR) and 0x8233 (PARAGRAPH SEPARATOR)
You can say:
perl -pe 's/foo bar([^\015]*)(\015?\012)/foo $1 bar$2/g' *.txt
The line endings would be preserved, i.e. would be the same as the input file.
You might also want to refer to perldoc perlport.
is there a way to get perl handle automatically platform-specific line-ending?
Yes. It's actually the default.
The issue is that you're trying to handle Windows line endings on a unix platform.
This will definitely do it:
perl -pe'
BEGIN {
binmode STDIN, ":crlf";
binmode STDOUT, ":crlf";
}
s/foo bar(.*)$/foo $1 bar/g;
' <file.txt
Might I suggest you keep doing it manually?
Alternatively, you could convert the file to a text file and convert it back.
<file.orig dos2unix | perl -pe'...' | unix2dos >file.new