Get substring using either perl or sed - regex

I can't seem to get a substring correctly.
declare BRANCH_NAME="bugfix/US3280841-something-duh";
# Trim it down to "US3280841"
TRIMMED=$(echo $BRANCH_NAME | sed -e 's/\(^.*\)\/[a-z0-9]\|[A-Z0-9]\+/\1/g')
That still returns bugfix/US3280841-something-duh.
If I try an use perl instead:
declare BRANCH_NAME="bugfix/US3280841-something-duh";
# Trim it down to "US3280841"
TRIMMED=$(echo $BRANCH_NAME | perl -nle 'm/^.*\/([a-z0-9]|[A-Z0-9])+/; print $1');
That outputs nothing.
What am I doing wrong?

Using bash parameter expansion only:
$: # don't use caps; see below.
$: declare branch="bugfix/US3280841-something-duh"
$: tmp="${branch##*/}"
$: echo "$tmp"
US3280841-something-duh
$: trimmed="${tmp%%-*}"
$: echo "$trimmed"
US3280841
Which means:
$: tmp="${branch_name##*/}"
$: trimmed="${tmp%%-*}"
does the job in two steps without spawning extra processes.
In sed,
$: sed -E 's#^.*/([^/-]+)-.*$#\1#' <<< "$branch"
This says "after any or no characters followed by a slash, remember one or more that are not slashes or dashes, followed by a not-remembered dash and then any or no characters, then replace the whole input with the remembered part."
Your original pattern was
's/\(^.*\)\/[a-z0-9]\|[A-Z0-9]\+/\1/g'
This says "remember any number of anything followed by a slash, then a lowercase letter or a digit, then a pipe character (because those only work with -E), then a capital letter or digit, then a literal plus sign, and then replace it all with what you remembered."
GNU's manual is your friend. I look stuff up all the time to make sure I'm doing it right. Sometimes it still takes me a few tries, lol.
An aside - try not to use all-capital variable names. That is a convention that indicates it's special to the OS, like RANDOM or IFS.

You may use this sed:
sed -E 's~^.*/|-.*$~~g' <<< "$BRANCH_NAME"
US3280841
Ot this awk:
awk -F '[/-]' '{print $2}' <<< "$BRANCH_NAME"
US3280841

sed 's:[^/]*/\([^-]*\)-.*:\1:'<<<"bugfix/US3280841-something-duh"

Perl version just has + in wrong place. It should be inside the capture brackets:
TRIMMED=$(echo $BRANCH_NAME | perl -nle 'm/^.*\/([a-z0-9A-Z]+)/; print $1');

Just use a ^ before A-Z0-9
TRIMMED=$(echo $BRANCH_NAME | sed -e 's/\(^.*\)\/[a-z0-9]\|[^A-Z0-9]\+/\1/g')
in your sed case.
Alternatively and briefly, you can use
TRIMMED=$(echo $BRANCH_NAME | sed "s/[a-z\/\-]//g" )
too.

type on shell terminal
$ BRANCH_NAME="bugfix/US3280841-something-duh"
$ echo $BRANCH_NAME| perl -pe 's/.*\/(\w\w[0-9]+).+/\1/'
use s (substitute) command instead of m (match)
perl is a superset of sed so it'd be identical 'sed -E' instead of 'perl -pe'

Another variant using Perl Regular Expression Character Classes (see perldoc perlrecharclass).
echo $BRANCH_NAME | perl -nE 'say m/^.*\/([[:alnum:]]+)/;'

Related

sed: struggling with substitution and regex for ^*=

I am running a linux bash script. From stout lines like: /gpx/trk/name=MyTrack1, I want to keep only the end of line after =.
I am struggling to understand why the following sed command is not working as I expect:
echo "/gpx/trk/name=MyTrack1" | sed -e "s/^*=//"
(I also tried)
echo "/gpx/trk/name=MyTrack1" | sed -e "s/^*\=//"
The return is always /gpx/trk/name=MyTrack1 and not MyTrack1
An even simpler way if this is the only structure you are concerned about:
echo "/gpx/trk/name=MyTrack1" | cut -d = -f 2
Simply try:
echo "/gpx/trk/name=MyTrack1" | sed 's/.*=//'
Solution 2nd: With another sed.
echo "/gpx/trk/name=MyTrack1" | sed 's/\(.*=\)\(.*\)/\2/'
Explanation: As per OP's request adding explanation for this code here:
s: Means telling sed to do substitution operation.
\(.*=\): Creating first place in memory to keep this regex's value which tells sed to keep everything in 1st place of memory from starting to till = so text /gpx/trk/name= will be in 1 place.
\(.*\): Creating 2nd place in memory for sed telling it to keep everything now(after the match of 1st one, so this will start after =) and have value in it as MyTrack1
/\2/: Now telling sed to substitute complete line with only 2nd memory place holder which is MyTrack1
Solution 3rd: Or with awk considering that your Input_file is same as shown samples.
echo "/gpx/trk/name=MyTrack1" | awk -F'=' '{print $2}'
Solution 4th: With awk's match.
echo "/gpx/trk/name=MyTrack1" | awk 'match($0,/=.*$/){print substr($0,RSTART+1,RLENGTH-1)}'
$ echo "/gpx/trk/name=MyTrack1" | sed -e "s/^.*=//"
MyTrack1
The regular expression ^.*= matches anything up to and including the last = in the string.
Your regular expression ^*= would match the literal string *= at the start of a string, e.g.
$ echo "*=/gpx/trk/name=MyTrack1" | sed -e "s/^*=//"
/gpx/trk/name=MyTrack1
The * character in a regular expression usually modifies the immediately previous expression so that zero or more of it may be matched. When * occurs at the start of an expression on the other hand, it matches the character *.
Not to take you off the sed track, but this is easy with Bash alone:
$ echo "$s"
/gpx/trk/name=MyTrack1
$ echo "${s##*=}"
MyTrack1
The ##*= pattern removes the maximal pattern from the beginning of the string to the last =:
$ s="1=2=3=the rest"
$ echo "${s##*=}"
the rest
The equivalent in sed would be:
$ echo "$s" | sed -E 's/^.*=(.*)/\1/'
the rest
Where #*= would remove the minimal pattern:
$ echo "${s#*=}"
2=3=the rest
And in sed:
$ echo "$s" | sed -E 's/^[^=]*=(.*)/\1/'
2=3=the rest
Note the difference in * in Bash string functions vs a sed regex:
The * in Bash (in this context) is glob like - itself means 'any character'
The * in a regex refers to the previous pattern and for 'any character' you need .*
Bash has extensive string manipulation functions. You can read about Bash string patterns in BashFAQ.

Regex EOL replace with perl is giving unexpected results

Why is there a dollar sign at the starting of line 2 and line 3?
➜ echo -e "hello\nworld" | perl -pe 's/$/\$/g'
hello$
$world$
$%
Above, I am trying to add a dollar sign at the end of each line, but somehow it's appending a dollar sign at the beginning too. It does that when global flag is enabled. But when I remove the global flag, it works fine:
➜ echo -e "hello\nworld" | perl -pe 's/$/\$/'
hello$
world$
Can anyone explain what's happening? Maybe it has something to do with '\r\n' characters?
EDIT : Adding the lookbehind case
It's not just breaking in this cases, but other cases as well. Consider the following:
➜ echo -e "A\nB\nC\nD" | perl -pe 's/(?<!A)$/\$/'
A
$B$
C$
D$
Above, I want to mark rows which don't end in "A" with $.
The extra dollar sign in line 2 shouldn't be there. I'm not even using global flag.
SOLUTION : Okay got it now. The solution for second one is like this (for explanation, refer to Wiktor Stribiżew's answer)
➜ echo -e "A\nB\nC\nD" | perl -pe 's/(?<!A|\n)$/\$/'
A
B$
C$
D$
But beware, if you try with more than single characters, it will throw
Variable length lookbehind not implemented in regex. For example:
➜ echo -e "AA\nBB\nCC\nDD" | perl -pe 's/(?<!AA|\n)$/\$/'
Variable length lookbehind not implemented in regex m/(?<!AA|\n)$/ at -e line 1.
To solve this, add the appropriate number of . before newline.
➜ echo -e "AA\nBB\nCC\nDD" | perl -pe 's/(?<!AA|.\n)$/\$/'
AA
BB$
CC$
DD$
The point is that $ is a zero-width assertion and it can match before a final newline. Perl reads a line with a trailing \n, so $ matches twice: before and after that.
Your string basically goes to Perl as two lines:
hello\n
world\n
And the $ can match both before a final newline and at the very end of the string. Thus, there are two matches in both lines ("strings" in this context).
If you want to match the very end of string, use \z:
perl -pe 's/\z/\$/g'
since \z only matches the very end of the string, but it is not likely anyone would want to use that since it will effectively insert a $ at the start of the second and subsequent lines, adding it as the final line as well.
To only insert $ before the last \n and stop, use your perl -pe 's/$/\$/', with no g modifier.
If you really want to use it with the global replace, you can use the following command:
echo -e "hello\nworld" | perl -pe 's/^(.*)$/\1\$/g'
hello$
world$
or without back-references you can use:
echo -e "hello\nworld" | perl -pe 's/\n$/\$\n/g'
hello$
world$
you might need to replace \n by \r\n if you manipulate a file from windows or just use dos2unix to remove Windows EOL chars \r.

sed regex with alternative on Solaris doesn't work

Currently I'm trying to use sed with regex on Solaris but it doesn't work.
I need to show only lines matching to my regex.
sed -n -E '/^[a-zA-Z0-9]*$|^a_[a-zA-Z0-9]*$/p'
input file:
grtad
a_pitr
_aupa
a__as
baman
12353
ai345
ki_ag
-MXx2
!!!23
+_)#*
I want to show only lines matching to above regex:
grtad
a_pitr
baman
12353
ai345
Is there another way to use alternative? Is it possible in perl?
Thanks for any solutions.
With Perl
perl -ne 'print if /^(a_)?[a-zA-Z0-9]*$/' input.txt
The (a_)? matches a_ one-or-zero times, so optionally. It may or may not be there.
The (a_) also captures the match, what is not needed. So you can use (?:a_)? instead. The ?: makes () only group what is inside (so ? applies to the whole thing), but not remember it.
with grep
$ grep -xiE '(a_)?[a-z0-9]*' ip.txt
grtad
a_pitr
baman
12353
ai345
-x match whole line
-i ignore case
-E extended regex, if not available, use grep -xi '\(a_\)\?[a-z0-9]*'
(a_)? zero or one time match a_
[a-z0-9]* zero or more alphabets or numbers
With sed
sed -nE '/^(a_)?[a-zA-Z0-9]*$/p' ip.txt
or, with GNU sed
sed -nE '/^(a_)?[a-z0-9]*$/Ip' ip.txt

How to ignore word delimiters in sed

So I have a bash script which is working perfectly except for one issue with sed.
full=$(echo $full | sed -e 's/\b'$first'\b/ /' -e 's/ / /g')
This would work great except there are instances where the variable $first is preceeded immediately by a period, not a blank space. In those instances, I do not want the variable removed.
Example:
full="apple.orange orange.banana apple.banana banana";first="banana"
full=$(echo $full | sed -e 's/\b'$first'\b/ /' -e 's/ / /g')
echo $first $full;
I want to only remove the whole word banana, and not make any change to orange.banana or apple.banana, so how can I get sed to ignore the dot as a delimiter?
You want "banana" that is preceded by beginning-of-string or a space, and followed by a space or end-of-string
$ sed -r 's/(^|[[:blank:]])'"$first"'([[:blank:]]|$)/ /g' <<< "$full"
apple.orange orange.banana apple.banana
Note the use of -r option (for bsd sed, use -E) that enables extended regular expressions -- allow us to omit a lot of backslashes.

Change CSV Delimiter with sed

I've got a CSV file that looks like:
1,3,"3,5",4,"5,5"
Now I want to change all the "," not within quotes to ";" with sed, so it looks like this:
1;3;"3,5";5;"5,5"
But I can't find a pattern that works.
If you are expecting only numbers then the following expression will work
sed -e 's/,/;/g' -e 's/\("[0-9][0-9]*\);\([0-9][0-9]*"\)/\1,\2/g'
e.g.
$ echo '1,3,"3,5",4,"5,5"' | sed -e 's/,/;/g' -e 's/\("[0-9][0-9]*\);\([0-9][0-9]*"\)/\1,\2/g'
1;3;"3,5";4;"5,5"
You can't just replace the [0-9][0-9]* with .* to retain any , in that is delimted by quotes, .* is too greedy and matches too much. So you have to use [a-z0-9]*
$ echo '1,3,"3,5",4,"5,5",",6","4,",7,"a,b",c' | sed -e 's/,/;/g' -e 's/\("[a-z0-9]*\);\([a-z0-9]*"\)/\1,\2/g'
1;3;"3,5";4;"5,5";",6";"4,";7;"a,b";c
It also has the advantage over the first solution of being simple to understand. We just replace every , by ; and then correct every ; in quotes back to a ,
You could try something like this:
echo '1,3,"3,5",4,"5,5"' | sed -r 's|("[^"]*),([^"]*")|\1\x1\2|g;s|,|;|g;s|\x1|,|g'
which replaces all commas within quotes with \x1 char, then replaces all commas left with semicolons, and then replaces \x1 chars back to commas. This might work, given the file is correctly formed, there're initially no \x1 chars in it and there're no situations where there is a double quote inside double quotes, like "a\"b".
Using gawk
gawk '{$1=$1}1' FPAT="([^,]+)|(\"[^\"]+\")" OFS=';' filename
Test:
[jaypal:~/Temp] cat filename
1,3,"3,5",4,"5,5"
[jaypal:~/Temp] gawk '{$1=$1}1' FPAT='([^,]+)|(\"[^\"]+\")' OFS=';' filename
1;3;"3,5";4;"5,5"
This might work for you:
echo '1,3,"3,5",4,"5,5"' |
sed 's/\("[^",]*\),\([^"]*"\)/\1\n\2/g;y/,/;/;s/\n/,/g'
1;3;"3,5";4;"5,5"
Here's alternative solution which is longer but more flexible:
echo '1,3,"3,5",4,"5,5"' |
sed 's/^/\n/;:a;s/\n\([^,"]\|"[^"]*"\)/\1\n/;ta;s/\n,/;\n/;ta;s/\n//'
1;3;"3,5";4;"5,5"