Regex EOL replace with perl is giving unexpected results

Regex EOL replace with perl is giving unexpected results - regex

Why is there a dollar sign at the starting of line 2 and line 3?
➜ echo -e "hello\nworld" | perl -pe 's/$/\$/g'
hello$
$world$
$%
Above, I am trying to add a dollar sign at the end of each line, but somehow it's appending a dollar sign at the beginning too. It does that when global flag is enabled. But when I remove the global flag, it works fine:
➜ echo -e "hello\nworld" | perl -pe 's/$/\$/'
hello$
world$
Can anyone explain what's happening? Maybe it has something to do with '\r\n' characters?
EDIT : Adding the lookbehind case
It's not just breaking in this cases, but other cases as well. Consider the following:
➜ echo -e "A\nB\nC\nD" | perl -pe 's/(?<!A)$/\$/'
A
$B$
C$
D$
Above, I want to mark rows which don't end in "A" with $.
The extra dollar sign in line 2 shouldn't be there. I'm not even using global flag.
SOLUTION : Okay got it now. The solution for second one is like this (for explanation, refer to Wiktor Stribiżew's answer)
➜ echo -e "A\nB\nC\nD" | perl -pe 's/(?<!A|\n)$/\$/'
A
B$
C$
D$
But beware, if you try with more than single characters, it will throw
Variable length lookbehind not implemented in regex. For example:
➜ echo -e "AA\nBB\nCC\nDD" | perl -pe 's/(?<!AA|\n)$/\$/'
Variable length lookbehind not implemented in regex m/(?<!AA|\n)$/ at -e line 1.
To solve this, add the appropriate number of . before newline.
➜ echo -e "AA\nBB\nCC\nDD" | perl -pe 's/(?<!AA|.\n)$/\$/'
AA
BB$
CC$
DD$

The point is that $ is a zero-width assertion and it can match before a final newline. Perl reads a line with a trailing \n, so $ matches twice: before and after that.
Your string basically goes to Perl as two lines:
hello\n
world\n
And the $ can match both before a final newline and at the very end of the string. Thus, there are two matches in both lines ("strings" in this context).
If you want to match the very end of string, use \z:
perl -pe 's/\z/\$/g'
since \z only matches the very end of the string, but it is not likely anyone would want to use that since it will effectively insert a $ at the start of the second and subsequent lines, adding it as the final line as well.
To only insert $ before the last \n and stop, use your perl -pe 's/$/\$/', with no g modifier.

If you really want to use it with the global replace, you can use the following command:
echo -e "hello\nworld" | perl -pe 's/^(.*)$/\1\$/g'
hello$
world$
or without back-references you can use:
echo -e "hello\nworld" | perl -pe 's/\n$/\$\n/g'
hello$
world$
you might need to replace \n by \r\n if you manipulate a file from windows or just use dos2unix to remove Windows EOL chars \r.

Related

Extract string between underscores and dot

I have strings like these:
/my/directory/file1_AAA_123_k.txt
/my/directory/file2_CCC.txt
/my/directory/file2_KK_45.txt
So basically, the number of underscores is not fixed. I would like to extract the string between the first underscore and the dot. So the output should be something like this:
AAA_123_k
CCC
KK_45
I found this solution that works:
string='/my/directory/file1_AAA_123_k.txt'
tmp="${string%.*}"
echo $tmp | sed 's/^[^_:]*[_:]//'
But I am wondering if there is a more 'elegant' solution (e.g. 1 line code).

With bash version >= 3.0 and a regex:
[[ "$string" =~ _(.+)\. ]] && echo "${BASH_REMATCH[1]}"

You can use a single sed command like
sed -n 's~^.*/[^_/]*_\([^/]*\)\.[^./]*$~\1~p' <<< "$string"
sed -nE 's~^.*/[^_/]*_([^/]*)\.[^./]*$~\1~p' <<< "$string"
See the online demo. Details:
^ - start of string
.* - any text
/ - a / char
[^_/]* - zero or more chars other than / and _
_ - a _ char
\([^/]*\) (POSIX BRE) / ([^/]*) (POSIX ERE, enabled with E option) - Group 1: any zero or more chars other than /
\. - a dot
[^./]* - zero or more chars other than . and /
$ - end of string.
With -n, default line output is suppressed and p only prints the result of successful substitution.

With your shown samples, with GNU grep you could try following code.
grep -oP '.*?_\K([^.]*)' Input_file
Explanation: Using GNU grep's -oP options here to print exact match and to enable PCRE regex respectively. In main program using regex .*?_\K([^.]*) to get value between 1st _ and first occurrence of .. Explanation of regex is as follows:
Explanation of regex:
.*?_ ##Matching from starting of line to till first occurrence of _ by using lazy match .*?
\K ##\K will forget all previous matched values by regex to make sure only needed values are printed.
([^.]*) ##Matching everything till first occurrence of dot as per need.

A simpler sed solution without any capturing group:
sed -E 's/^[^_]*_|\.[^.]*$//g' file
AAA_123_k
CCC
KK_45

If you need to process the file names one at a time (eg, within a while read loop) you can perform two parameter expansions, eg:
$ string='/my/directory/file1_AAA_123_k.txt.2'
$ tmp="${string#*_}"
$ tmp="${tmp%%.*}"
$ echo "${tmp}"
AAA_123_k
One idea to parse a list of file names at the same time:
$ cat file.list
/my/directory/file1_AAA_123_k.txt.2
/my/directory/file2_CCC.txt
/my/directory/file2_KK_45.txt
$ sed -En 's/[^_]*_([^.]+).*/\1/p' file.list
AAA_123_k
CCC
KK_45

Using sed
$ sed 's/[^_]*_//;s/\..*//' input_file
AAA_123_k
CCC
KK_45

This is easy, except that it includes the initial underscore:
ls | grep -o "_[^.]*"

How do I take only the first occurrence of a hyphen in sed?

I have a string, for example home/JOHNSMITH-4991-common-task-list, and I want to take out the uppercase part and the numbers with the hyphen between them. I echo the string and pipe it to sed like so, but I keep getting all the hyphens I don't want, e.g.:
echo home/JOHNSMITH-4991-common-task-list | sed 's/[^A-Z0-9-]//g'
gives me:
JOHNSMITH-4991---
I need:
JOHNSMITH-4991
How do I ignore all but the first hyphen?

You can use
sed 's,.*/\([^-]*-[^-]*\).*,\1,'
POSIX BRE regex details:
.* - any zero or more chars
/ - a / char
\([^-]*-[^-]*\) - Group 1: any zero or more chars other than -, a hyphen, and then again zero or more chars other than -
.* - any zero or more chars
The replacement is the Group 1 placeholder, \1, to restore just the text captured.
See the online demo:
#!/bin/bash
s="home/JOHNSMITH-4991-common-task-list"
sed 's,.*/\([^-]*-[^-]*\).*,\1,' <<< "$s"
# => JOHNSMITH-4991

1st solution: With awk it will be much easier and we could keep it simple, please try following, written and tested with your shown samples.
echo "echo home/JOHNSMITH-4991-common-task-list" | awk -F'/|-' '{print $2"-"$3}'
Explanation: Simple explanation would be, setting field separator as / OR - and printing 2nd field - and 3rd field of current line.
2nd solution: Using match function of awk program here.
echo "echo home/JOHNSMITH-4991-common-task-list" |
awk '
match($0,/\/[^-]*-[^-]*/){
print substr($0,RSTART+1,RLENGTH-1)
}'
3rd solution: Using GNU grep solution here. Using -oP option of grep here, to print matched values with o option and to enable ERE(extended regular expression) with P option. Then in main program of grep using .*/ followed by \K to ignore previous matched part and then mentioning [^-]*-[^-]* to make sure to get values just before 2nd occurrence of - in matched line.
echo "echo home/JOHNSMITH-4991-common-task-list" | grep -oP '.*/\K[^-]*-[^-]*'

Here is a simple alternative solution using cut with bash string substitution:
s='home/JOHNSMITH-4991-common-task-list'
cut -d- -f1-2 <<< "${s##*/}"
JOHNSMITH-4991

You could match until the first occurrence of the /, then clear the match buffer with \K and then repeat the character class 1+ times with a hyphen in between to select at least characters before and after the hyphen.
[^/]*/\K[A-Z0-9]+-[A-Z0-9]+
If supported, using gnu grep:
echo "echo home/JOHNSMITH-4991-common-task-list" | grep -oP '[^/]*/\K[A-Z0-9]+-[A-Z0-9]+'
Output
JOHNSMITH-4991
If gnu awk is an option, using the same pattern but with a capture group:
echo "home/JOHNSMITH-4991-common-task-list" | awk 'match($0, /[^\/]*\/([A-Z0-9]+-[A-Z0-9]+)/, a) {print a[1]}'
If the desired output is always the first match where the character class with a hyphen matches:
echo "home/JOHNSMITH-4991-common-task-list" | awk -v FPAT="[A-Z0-9]+-[A-Z0-9]+" '$0=$1'
Output
JOHNSMITH-4991

Assumptions:
could be more than one fwd slash in string
(after the last fwd slash) there are 2 or more hyphens in the string
desired output is between last fwd slash and 2nd hyphen
One idea using parameter substitutions:
$ string='home/dir/JOHNSMITH-4991-common-task-list'
$ string1="${string##*/}"
$ typeset -p string1
declare -- string1="JOHNSMITH-4991-common-task-list"
$ string1="${string1%%-*}"
$ typeset -p string1
declare -- string1="JOHNSMITH"
$ string2="${string#*-}"
$ typeset -p string2
declare -- string2="4991-common-task-list"
$ string2="${string2%%-*}"
$ typeset -p string2
declare -- string2="4991"
$ newstring="${string1}-${string2}"
$ echo "${newstring}"
JOHNSMITH-4991
NOTES:
typeset commands added solely to show progression of values
a bit of typing but if doing this a lot of times in bash the overall performance should be good compared to other solutions that require spawning a sub-process
if there's a need to parse a large number of strings best performance will come from streaming all strings at once (via a file?) to one of the other solutions (eg, a single awk call that processes all strings will be faster than running the set of strings through a bash loop and performing all of these parameter substitutions)

Get substring using either perl or sed

I can't seem to get a substring correctly.
declare BRANCH_NAME="bugfix/US3280841-something-duh";
# Trim it down to "US3280841"
TRIMMED=$(echo $BRANCH_NAME | sed -e 's/\(^.*\)\/[a-z0-9]\|[A-Z0-9]\+/\1/g')
That still returns bugfix/US3280841-something-duh.
If I try an use perl instead:
declare BRANCH_NAME="bugfix/US3280841-something-duh";
# Trim it down to "US3280841"
TRIMMED=$(echo $BRANCH_NAME | perl -nle 'm/^.*\/([a-z0-9]|[A-Z0-9])+/; print $1');
That outputs nothing.
What am I doing wrong?

Using bash parameter expansion only:
$: # don't use caps; see below.
$: declare branch="bugfix/US3280841-something-duh"
$: tmp="${branch##*/}"
$: echo "$tmp"
US3280841-something-duh
$: trimmed="${tmp%%-*}"
$: echo "$trimmed"
US3280841
Which means:
$: tmp="${branch_name##*/}"
$: trimmed="${tmp%%-*}"
does the job in two steps without spawning extra processes.
In sed,
$: sed -E 's#^.*/([^/-]+)-.*$#\1#' <<< "$branch"
This says "after any or no characters followed by a slash, remember one or more that are not slashes or dashes, followed by a not-remembered dash and then any or no characters, then replace the whole input with the remembered part."
Your original pattern was
's/\(^.*\)\/[a-z0-9]\|[A-Z0-9]\+/\1/g'
This says "remember any number of anything followed by a slash, then a lowercase letter or a digit, then a pipe character (because those only work with -E), then a capital letter or digit, then a literal plus sign, and then replace it all with what you remembered."
GNU's manual is your friend. I look stuff up all the time to make sure I'm doing it right. Sometimes it still takes me a few tries, lol.
An aside - try not to use all-capital variable names. That is a convention that indicates it's special to the OS, like RANDOM or IFS.

You may use this sed:
sed -E 's~^.*/|-.*$~~g' <<< "$BRANCH_NAME"
US3280841
Ot this awk:
awk -F '[/-]' '{print $2}' <<< "$BRANCH_NAME"
US3280841

sed 's:[^/]*/\([^-]*\)-.*:\1:'<<<"bugfix/US3280841-something-duh"

Perl version just has + in wrong place. It should be inside the capture brackets:
TRIMMED=$(echo $BRANCH_NAME | perl -nle 'm/^.*\/([a-z0-9A-Z]+)/; print $1');

Just use a ^ before A-Z0-9
TRIMMED=$(echo $BRANCH_NAME | sed -e 's/\(^.*\)\/[a-z0-9]\|[^A-Z0-9]\+/\1/g')
in your sed case.
Alternatively and briefly, you can use
TRIMMED=$(echo $BRANCH_NAME | sed "s/[a-z\/\-]//g" )
too.

type on shell terminal
$ BRANCH_NAME="bugfix/US3280841-something-duh"
$ echo $BRANCH_NAME| perl -pe 's/.*\/(\w\w[0-9]+).+/\1/'
use s (substitute) command instead of m (match)
perl is a superset of sed so it'd be identical 'sed -E' instead of 'perl -pe'

Another variant using Perl Regular Expression Character Classes (see perldoc perlrecharclass).
echo $BRANCH_NAME | perl -nE 'say m/^.*\/([[:alnum:]]+)/;'

Extract QueryString value using sed

I have the following lines in an apache access log
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229655&blah
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229656&blah
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229657&blah
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229658&blah
and i want to extract the MSISDN value only, so expected output would be
647930229655
647930229656
647930229657
647930229658
I'm using the following sed command but i can't get it to stop capturing at &
sed 's/.*MSISDN=\(.*\)/\1/'

sed solution:
sed -E 's/.*&MSISDN=([^&]+).*/\1/' file
& - is key/value pair separator in URL syntax, so you should rely on it
([^&]+) - 1st captured group containing any character sequence except &
\1 - backreference to the 1st captured group
The output:
647930229655
647930229656
647930229657
647930229658

-o : means print only matching string not the whole line.
-P: To enable pcre regex.
\K: means ignore everything on the left. But should be part of actual input string.
\d: means digit, + means one or more digit.
grep -oP 'MSISDN=\K\d+' input
647930229655
647930229656
647930229657
647930229658

Following simple sed may help you on same.
sed 's/.*MSISDN=//;s/&.*//' Input_file
Explanation:
s/.*MSISDN=//: s means substitute .*MSISDN= string with // NULL in current line.
; semi colon tells sed that there is 1 more statement to be executed.
s/&.*//g': s/&.*// means substitute &.* from & to everything with NULL.

$ grep -oP '(?<=&MSISDN=)\d+' file
647930229655
647930229656
647930229657
647930229658
-o option is meant to show only matched output
-P option is meant to enable PCRE (Perl Compatible Regex)
(?<=regex) this is to enforce positive look behind assertion. You can read more about them over here. Lookarounds dont consume any characters while matching unlike normal regex. Hence the only matched output you get it \d+ which is 1 or more digits.
or using sed:
$ sed -r 's/^.*MSISDN=([0-9]+).*$/\1/' file
647930229655
647930229656
647930229657
647930229658

you can also pipe cut to cut
cut -d '&' -f3 Input_file |cut -d '=' -f2

having a regex replacing across lines, retain the newlines?

I'd like to have a substitute or print style command with a regex working across lines. And lines retained.
$ echo -e 'a\nb\nc\nd\ne\nf\ng' | tr -d '\n' | grep -or 'b.*f'
bcdef
or
$ echo -e 'a\nb\nc\nd\ne\nf\ng' | tr -d '\n' | sed -r 's|b(.*)f|y\1z|'
aycdezg
i'd like to use grep or sed because i'd like to know what people would've done before awk or perl ..
would they not have? was .* not available? had they no other equivalent?
to possibly modify some input with a regex that spans across lines, and print it to stdout or output to a file, retaining the lines.

This should do what you're looking for:
$ echo -e 'a\nb\nc\nd\ne\nf\ng' | sed ':a;$s/b\([^f]*\)f/y\1z/;N;ba'
a
y
c
d
e
z
g
It accumulates all the lines then does the replacement. It looks for the first "f". If you want it to look for the last "f", change [^f] to ..
Note that this may make use of features added to sed after AWK or Perl became available (AWK has been around a looong time).
Edit:
To do a multi-line grep requires only a little modification:
$ echo -e 'a\nb\nc\nd\ne\nf\ng' | sed ':a;$s/^[^b]*\(b[^f]*f\)[^f]*$/\1/;N;ba'
b
c
d
e
f

sed can match across newlines through the use of its N command. For example, the following sed command will replace bar followed a newline followed by foo with ###:
$ echo -e "foo\nbar\nbaz\nqux" | sed 'N;s/bar\nbaz/###/;P;D'
foo
###
qux
The N command will append the next input line to the current pattern space separated by an embedded newline (\n)
The P command will print the current pattern space up to and including the first embedded newline.
The D command will delete up to and including the first embedded newline in the pattern space. It will also start next cycle but skip reading from the input if there is still data in the pattern space.
Through the use of these 3 commands, you can essentially do any sort of s command replacement looking across N-lines.
Edit
If your question is how can I remove the need for tr in the two examples above and just use sed then here you go:
$ echo -e 'a\nb\nc\nd\ne\nf\ng' | sed ':a;N;$!ba;s/\n//g;y/ag/yz/'
ybcdefz

Proven tools to the rescue.
echo -e "foo\nbar\nbaz\nqux" | perl -lpe 'BEGIN{$/=""}s/foo\nbar/###/'

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex EOL replace with perl is giving unexpected results - regex

Related

Extract string between underscores and dot

How do I take only the first occurrence of a hyphen in sed?

Get substring using either perl or sed

Extract QueryString value using sed

having a regex replacing across lines, retain the newlines?

Categories

Resources