why doesn't this Perl capture work - regex

I expected this to capture and print just the group defined in parens, but instead it prints the whole line. How can I capture and print just the group in parens?
echo "abcdef" | perl -ne "print $1 if /(cd)/ "
What I want this to print: cd
What it actually prints: abcdef
How to fix?

In the perl command, you have to use single quotes or protect variables :
echo "abcdef" | perl -ne "print \$1 if /(cd)/"
or
echo "abcdef" | perl -ne 'print $1 if /(cd)/'
In double quotes, the shell expand $1.

The instant fix to your question is to change your double quotes to single quotes, like this:
$ echo abcdef | perl -ne 'print $1 if /(cd)/'
cd
With double quotes, the shell environment interprets your unprotected variable $1, which in your environment apparently evaluates to an empty string. So perl only receives the command print if /(cd)/ which is an implied command print $_ if /(cd)/ which prints the entire line.
You can also use a protected variable like this:
$ echo abcdef | perl -ne "print \$1 if /(cd)/"
cd
Note that matches which use different delimiters (other than / and /) are required to begin with the m keyword rather than using the shorthand form. But in your case, this does not matter, but it is often something worth being aware of when working with matches, e.g., m|/| would match a / character using the pipe as the delimiter for the regular expression.

Related

Replace Random Characters in a Variable

I want to replace value of a variable (can contain a number, a character, a string of characters).
$ echo $VAR
http://some-random-string.watch.film.tv/nfvsere/watch/skrz1j8exe/chunks.m3u8?nimblesessionid=30931574352........
So far, I've tried this command, however it's not working, so I'm thinking some of these might need a regex.
$ echo $VAR | sed -e "s/\(http[^^]*\).*\(.watch\)/\1"mystring"\2/g"
$ echo $VAR | sed -e "s/\(https\?:\/\/\).*\(.watch\)/\1"mystring"\2/g"
$ echo $VAR | sed -e "s/\(http[s]\?:\/\/\).*\(.watch\)/\1"mystring"\2/g"
I'm aware that there are questions that answer similar queries, but they have not been of help.
echo $VAR | sed 's|\(http[x]*://\)[^.]*\(.*\)|\1mystring\2|'
explanation
s| # substitute
\(http[x]*://\) # save first part in arg1 (\1)
[^.]* # all without '.'
\(.*\) # save the rest in arg2 (\2)
|\1 # print arg1
mystring # print your replacement
\2 # print arg2
|
In sed you have to escape any control characters like forward slash before matching:
echo $VAR | sed 's/http.\/\/.*\.watch\.film\.tv/http:\/\/mystring.watch.film.tv/'
You don't need sed. This task can be done in just shell:
$ var='http://some-random-string.watch.film.tv/nfvsere/watch/skrz1j8exe/chunks.m3u8?nimblesessionid=30931574352'
$ echo "${var%%//*}//mystring.watch${var#*.watch}"
http://mystring.watch.film.tv/nfvsere/watch/skrz1j8exe/chunks.m3u8?nimblesessionid=30931574352
How it works:
${var%%//*} returns the value of $var with the first // and everything after it removed.
//mystring.watch adds the string that we want.
${var#*.watch}" returns the value of $var with everything up to and including the first occurrence of .watch removed.
Because this approach does not require pipelines or subshells, it will be more efficient.
gnu sed
$ echo $VAR | sed -E 's/^(http:\/\/)\S+(\.watch\.film\.tv\/)/\1"mystring"\2/i'

Get substring using either perl or sed

I can't seem to get a substring correctly.
declare BRANCH_NAME="bugfix/US3280841-something-duh";
# Trim it down to "US3280841"
TRIMMED=$(echo $BRANCH_NAME | sed -e 's/\(^.*\)\/[a-z0-9]\|[A-Z0-9]\+/\1/g')
That still returns bugfix/US3280841-something-duh.
If I try an use perl instead:
declare BRANCH_NAME="bugfix/US3280841-something-duh";
# Trim it down to "US3280841"
TRIMMED=$(echo $BRANCH_NAME | perl -nle 'm/^.*\/([a-z0-9]|[A-Z0-9])+/; print $1');
That outputs nothing.
What am I doing wrong?
Using bash parameter expansion only:
$: # don't use caps; see below.
$: declare branch="bugfix/US3280841-something-duh"
$: tmp="${branch##*/}"
$: echo "$tmp"
US3280841-something-duh
$: trimmed="${tmp%%-*}"
$: echo "$trimmed"
US3280841
Which means:
$: tmp="${branch_name##*/}"
$: trimmed="${tmp%%-*}"
does the job in two steps without spawning extra processes.
In sed,
$: sed -E 's#^.*/([^/-]+)-.*$#\1#' <<< "$branch"
This says "after any or no characters followed by a slash, remember one or more that are not slashes or dashes, followed by a not-remembered dash and then any or no characters, then replace the whole input with the remembered part."
Your original pattern was
's/\(^.*\)\/[a-z0-9]\|[A-Z0-9]\+/\1/g'
This says "remember any number of anything followed by a slash, then a lowercase letter or a digit, then a pipe character (because those only work with -E), then a capital letter or digit, then a literal plus sign, and then replace it all with what you remembered."
GNU's manual is your friend. I look stuff up all the time to make sure I'm doing it right. Sometimes it still takes me a few tries, lol.
An aside - try not to use all-capital variable names. That is a convention that indicates it's special to the OS, like RANDOM or IFS.
You may use this sed:
sed -E 's~^.*/|-.*$~~g' <<< "$BRANCH_NAME"
US3280841
Ot this awk:
awk -F '[/-]' '{print $2}' <<< "$BRANCH_NAME"
US3280841
sed 's:[^/]*/\([^-]*\)-.*:\1:'<<<"bugfix/US3280841-something-duh"
Perl version just has + in wrong place. It should be inside the capture brackets:
TRIMMED=$(echo $BRANCH_NAME | perl -nle 'm/^.*\/([a-z0-9A-Z]+)/; print $1');
Just use a ^ before A-Z0-9
TRIMMED=$(echo $BRANCH_NAME | sed -e 's/\(^.*\)\/[a-z0-9]\|[^A-Z0-9]\+/\1/g')
in your sed case.
Alternatively and briefly, you can use
TRIMMED=$(echo $BRANCH_NAME | sed "s/[a-z\/\-]//g" )
too.
type on shell terminal
$ BRANCH_NAME="bugfix/US3280841-something-duh"
$ echo $BRANCH_NAME| perl -pe 's/.*\/(\w\w[0-9]+).+/\1/'
use s (substitute) command instead of m (match)
perl is a superset of sed so it'd be identical 'sed -E' instead of 'perl -pe'
Another variant using Perl Regular Expression Character Classes (see perldoc perlrecharclass).
echo $BRANCH_NAME | perl -nE 'say m/^.*\/([[:alnum:]]+)/;'

Parsing Karma Coverage Output in Bash for a Jenkins Job (Scripting)

I'm working with the following output:
=============================== Coverage summary ===============================
Statements : 26.16% ( 1681/6425 )
Branches : 6.89% ( 119/1727 )
Functions : 23.82% ( 390/1637 )
Lines : 26.17% ( 1680/6420 )
================================================================================
I would like to parse the 4 coverage percentage numbers without the percent via REGEX, into a comma separated list.
Any suggestions for a good regex expression for this? Or another good option?
The sed command:
sed -n '/ .*% /{s/.* \(.*\)% .*/\1/;p;}' input.txt | sed ':a;N;$!ba;s/\n/,/g'
gives the output:
26.16,6.89,23.82,26.17
Edit: A better answer, with only a single sed, would be:
sed -n '/ .*% /{s/.* \(.*\)% .*/\1/;H;};${g;s/\n/,/g;s/,//;p;}' input.txt
Explanation:
/ .*% / search for lines with a percentage value (note spaces)
s/.* \(.*\)% .*/\1/ and delete everything except the percentage value
H and then append it to the hold space, prefixed with a newline
$ then for the last line
g get the hold space
s/\n/,/g replace all the newlines with commas
s/,// and delete the initial comma
p and then finally output the result
To harden the regex, you could replace the search for the percentage value .*% with for example [0-9.]*%.
I think this is a grep job. This should help:
$ grep -oE "[0-9]{1,2}\.[0-9]{2}" input.txt | xargs | tr " " ","
Output:
26.16,6.89,23.82,26.17
The input file just contains what you have shown above. Obviously, there are other ways like cat to feed the input to the command.
Explanation:
grep -oE: only show matches using extended regex
xargs: put all results onto a single line
tr " " ",": translate the spaces into commas:
This is actually a nice shell tool belt example, I would say.
Including the consideration of Joseph Quinsey, the regex can be made more robust with a lookahead to assert a % sign after then numeric value using a Perl-compatible RE pattern:
grep -oP "[0-9]{1,2}\.[0-9]{2}(?=%)" input.txt | xargs | tr " " ","
Would you consider to use awk? Here's the command you may try,
$ awk 'match($0,/[0-9.]*%/){s=(s=="")?"":s",";s=s substr($0,RSTART,RLENGTH-1)}END{print s}' file
26.16,6.89,23.82,26.17
Brief explanation,
match($0,/[0-9.]*%/): find the record matched with regex [0-9.]*%
s=(s=="")?"":s",": since comma separated is required, we just need print commas before each matched except the first one.
s=s substr($0,RSTART,RLENGTH-1): print the matched part appended to s
Assuming the item names (Statements, Branches, ...) do not contain whitespaces, how about:
#!/bin/bash
declare -a keys
declare -a vaues
while read -r line; do
if [[ "$line" =~ ^([^\ ]+)\ *:\ *([0-9.]+)% ]]; then
keys+=(${BASH_REMATCH[1]})
values+=(${BASH_REMATCH[2]})
fi
done < output.txt
ifsback=$IFS # backup IFS
IFS=,
echo "${keys[*]}"
echo "${values[*]}"
IFS=$ifsback # restore IFS
which yields:
Statements,Branches,Functions,Lines
26.16,6.89,23.82,26.17
Yet another option, with perl:
cat the_file | perl -e 'while(<>){/(\d+\.\d+)%/ and $x.="$1,"}chop $x; print $x;'
The code, unrolled and explained:
while(<>){ # Read line by line. Put lines into $_
/(\d+\.\d+)%/ and $x.="$1,"
# Equivalent to:
# if ($_ =~ /(\d+\.\d+)%/) {$x.="$1,"}
# The regex matches "numbers", "dot", "numbers" and "%",
# stores just numbers on $1 (first capturing group)
}
chop $x; # Remove extra ',' and print result
print $x;
Somewhat shorter with an extra sed
cat the_file | perl -ne '/(\d+\.\d+)%/ and print "$1,"'|sed 's/.$//'
Uses "n" parameter which implies while(<>){}. For removing the last ',' we use sed.

search and replace regexp gives two different outputs if grouping metacharacters are used

I get different outputs from search and replace regexp in perl depending whether I use in place replace (sed alternative) or regular search replace and also depending on whether I use \1 or $1:
──> cat test1.txt
orig.avg.10
──> cat test2.txt
orig.avg.10
# EXPECTED
──> cat test1.txt | perl -lne '$_ =~ s/(avg\.[0-9]+)/$1\.vec/; print $_'
orig.avg.10.vec
# EXPECTED
──> cat test1.txt | perl -lne '$_ =~ s/(avg\.[0-9]+)/\1\.vec/; print $_'
orig.avg.10.vec
# EXPECTED
──> perl -p -i.bak -e "s/(avg\.[0-9]+)/\1\.vec/" test2.txt
──> cat test2.txt
orig.avg.10.vec
# UNEXPECTED
──> perl -p -i.bak -e "s/(avg\.[0-9]+)/$1\.vec/" test1.txt
──> cat test1.txt
orig..vec
Why this happens?
You are using " to wrap your perl code, but doing so means the shell can and will interpolate $1.
Use ' instead and everything will work as expected.
The problem is that I've used in #UNEXPECTED case double quotes which makes expanding $1 variable. Sometimes one need to write down the questions before realized the case.

Replace a string in bash script using regex for both linux and AIX

I have a bash script that I copy and run on both linux and AIX servers.
This script gets a "name" parameter which represents a file name, and I need to manipulate this name via regex (the purpose is irrelevant and very hard to explain).
From the name parameter I need to take the beginning until the first "-" character that is followed by a digit, and then concat it with the last "." character until the end of the string.
For example:
name: abcd-efg-1.23.4567-8.jar will become: abcd-efg.jar
name: abc123-abc3.jar will remain: abc123-abc3.jar
name: abc-890.jar will become: abc.jar
I've tried several variations of:
name=$1
regExpr="^(.*?)-\d.*\.(.*?)$/g"
echo $name
echo $(printf ${name} | sed -e $regExpr)
Also I cant use sed -r (seen on some examples) because AIX sed does not support the -r flag.
The last line is the problem of course; I think I need to somehow use $1 + $2 placeholders, but I can't seem to get it right.
How can I change my regex so that it does what I want?
Given the file:
abcd-efg-1.23.4567-8.jar
abc123-abc3.jar
abc-890.jar
This is a way to change the names you give:
$ sed 's/\(.\?\)-[0-9].*\(\.[a-z]*\)$/\1\2/' file
abcd-efg.jar
abc123-abc3.jar
abc.jar
Which is equivalent to (if you could use -r):
$ sed -r 's/(.?)-[0-9].*(\.[a-z]*)$/\1\2/' file
abcd-efg.jar
abc123-abc3.jar
abc.jar
It gets everything up to - + digit and "stores" in \1.
It gets from last . + letters and "stores" in \2.
Finally it prints those blocks back.
Note the extension could also be fetched with the basename builtin or with something like `"${line##*.}".
You could use this:
perl -F'(-(?:\d)|\.)' -ane 'print "$F[0].$F[$#F]"'
It splits the input on any - followed by a digit, or any .. Then it prints the first field, followed by a dot, followed by the last field.
Testing it out:
$ cat file
abcd-efg-1.23.4567-8.jar
abc123-abc3.jar
abc-890.jar
$ perl -F'(-(?:\d)|\.)' -ane 'print "$F[0].$F[$#F]"' file
abcd-efg.jar
abc123-abc3.jar
abc.jar
In sed, you could simply use the following.
#!/bin/sh
STRING=$( cat <<EOF
abcd-efg-1.23.4567-8.jar
abc123-abc3.jar
abc-890.jar
EOF
)
echo "$STRING" | sed 's/-[0-9].*\(\.[^.]\+\)$/\1/'
# abcd-efg.jar
# abc123-abc3.jar
# abc.jar
This matches a hyphen followed by a number and everything after and replaces with the file extension.
Or you may consider using a Perl one-liner:
echo "$STRING" | perl -pe 's/-\d.*(?=\.[^.]+$)//'
# abcd-efg.jar
# abc123-abc3.jar
# abc.jar
when a successful regex match is made, perl will capture whatever is matched in parenthesis ( .. ) as $1, $2, etc.
$ perl -e 'my $arg = $ARGV[0]; $arg =~ /^(.*?)-\d.*\.(.*?)$/; print "$1.$2\n"; ' abc-890.jar
abc.jar