Extract string between underscores and dot - regex

I have strings like these:
/my/directory/file1_AAA_123_k.txt
/my/directory/file2_CCC.txt
/my/directory/file2_KK_45.txt
So basically, the number of underscores is not fixed. I would like to extract the string between the first underscore and the dot. So the output should be something like this:
AAA_123_k
CCC
KK_45
I found this solution that works:
string='/my/directory/file1_AAA_123_k.txt'
tmp="${string%.*}"
echo $tmp | sed 's/^[^_:]*[_:]//'
But I am wondering if there is a more 'elegant' solution (e.g. 1 line code).

With bash version >= 3.0 and a regex:
[[ "$string" =~ _(.+)\. ]] && echo "${BASH_REMATCH[1]}"

You can use a single sed command like
sed -n 's~^.*/[^_/]*_\([^/]*\)\.[^./]*$~\1~p' <<< "$string"
sed -nE 's~^.*/[^_/]*_([^/]*)\.[^./]*$~\1~p' <<< "$string"
See the online demo. Details:
^ - start of string
.* - any text
/ - a / char
[^_/]* - zero or more chars other than / and _
_ - a _ char
\([^/]*\) (POSIX BRE) / ([^/]*) (POSIX ERE, enabled with E option) - Group 1: any zero or more chars other than /
\. - a dot
[^./]* - zero or more chars other than . and /
$ - end of string.
With -n, default line output is suppressed and p only prints the result of successful substitution.

With your shown samples, with GNU grep you could try following code.
grep -oP '.*?_\K([^.]*)' Input_file
Explanation: Using GNU grep's -oP options here to print exact match and to enable PCRE regex respectively. In main program using regex .*?_\K([^.]*) to get value between 1st _ and first occurrence of .. Explanation of regex is as follows:
Explanation of regex:
.*?_ ##Matching from starting of line to till first occurrence of _ by using lazy match .*?
\K ##\K will forget all previous matched values by regex to make sure only needed values are printed.
([^.]*) ##Matching everything till first occurrence of dot as per need.

A simpler sed solution without any capturing group:
sed -E 's/^[^_]*_|\.[^.]*$//g' file
AAA_123_k
CCC
KK_45

If you need to process the file names one at a time (eg, within a while read loop) you can perform two parameter expansions, eg:
$ string='/my/directory/file1_AAA_123_k.txt.2'
$ tmp="${string#*_}"
$ tmp="${tmp%%.*}"
$ echo "${tmp}"
AAA_123_k
One idea to parse a list of file names at the same time:
$ cat file.list
/my/directory/file1_AAA_123_k.txt.2
/my/directory/file2_CCC.txt
/my/directory/file2_KK_45.txt
$ sed -En 's/[^_]*_([^.]+).*/\1/p' file.list
AAA_123_k
CCC
KK_45

Using sed
$ sed 's/[^_]*_//;s/\..*//' input_file
AAA_123_k
CCC
KK_45

This is easy, except that it includes the initial underscore:
ls | grep -o "_[^.]*"

Related

Sed: can not replace part of the string whithout replacing all of it

I am trying to replace part of the string, but can not find a proper regex for sed to execute it properly.
I have a string
/abc/foo/../bar
And I would like to achive the following result:
/abc/bar
I have tried to do it using this command:
echo $string | sed 's/\/[^:-]*\..\//\//'
But as result I am getting just /bar.
I understand that I must use group, but I just do not get it.
Could you, please, help me to find out this group that could be used?
You can use
#!/bin/bash
string='/abc/foo/../bar'
sed -nE 's~^(/[^/]*)(/.*)?/\.\.(/[^/]*).*~\1\3~p' <<< "$string"
See the online demo. Details:
-n - suppresses default line output
E - enables POSIX ERE regex syntax
^ - start of string
(/[^/]*) - Group 1: a / and then zero or more chars other than /
(/.*)? - an optional group 2: a / and then any text
/\.\. - a /.. fixed string
(/[^/]*) - Group 3: a / and then zero or more chars other than /
.* - the rest of the string.
\1\3 replaces the match with Group 1 and 3 values concatenated
p only prints the result of successful substitution.
You can use a capture group for the first part and then match until the last / to remove.
As you are using / to match in the pattern, you can opt for a different delimiter.
#!/bin/bash
string="/abc/foo/../bar"
sed 's~\(/[^/]*/\)[^:-]*/~\1~' <<< "$string"
The pattern in parts:
\( Capture group 1
/[^/]*/ Match from the first till the second / with any char other than / in between
\) Close group 1
[^:-]*/ Match optional chars other than : and - then match /
Output
/abc/bar
Using sed
$ sed 's#^\(/[^/]*\)/.*\(/\)#\1\2#' input_file
/abc/bar
or
$ sed 's#[^/]*/[^/]*/##2' input_file
/abc/bar
Using awk
string='/abc/foo/../bar'
awk -F/ '{print "/"$2"/"$NF}' <<< "$string"
#or
awk -F/ 'BEGIN{OFS=FS}{print $1,$2,$NF}' <<< "$string"
/abc/bar
Using bash
string='/abc/foo/../bar'
echo "${string%%/${string#*/*/}}/${string##*/}"
/abc/bar
Using any sed:
$ echo "$string" | sed 's:\(/[^/]*/\).*/:\1:'
/abc/bar

Capture word after pattern with slash

I want to extract word1 from:
something /CLIENT_LOGIN:word1 something else
I would like to extract the first word after matching pattern /CLIENT_LOGIN:.
Without the slash, something like this works:
A=something /CLIENT_LOGIN:word1 something else
B=$(echo $A | awk '$1 == "CLIENT_LOGIN" { print $2 }' FS=":")
With the slash though, I can't get it working (I tried putting / and \/ in front of CLIENT_LOGIN). I don't care getting it done with awk, grep, sed, ...
Using sed:
s='=something /CLIENT_LOGIN:word1 something else'
sed -E 's~.* /CLIENT_LOGIN:([^[:blank:]]+).*~\1~' <<< "$s"
word1
Details:
We use ~ as regex delimiter in sed
/CLIENT_LOGIN:([^[:blank:]]+) matches /CLIENT_LOGIN: followed by 1+ non-whitespace characters that is captured in group #1
.* on both sides matches text before and after our match
\1 is used in substitution to put 1st group's captured value back in output
1st solution: With your shown samples, please try following GNU grep solution.
grep -oP '^.*? /CLIENT_LOGIN:\K(\S+)' Input_file
Explanation: Simple explanation would be, using GNU grep's o and P options. Which are responsible for printing exact match and enabling PCRE regex. In main program, using regex ^.*? /CLIENT_LOGIN:\K(\S+): which means using lazy match from starting of value to till /CLIENT_LOGIN: to match very first occurrence of string. Then using \K option to forget till now matched values so tat we can print only required values, which is followed by \S+ which means match all NON-Spaces before any space comes.
2nd solution: Using awk's match function along with its split function to print the required value.
awk '
match($0,/\/CLIENT_LOGIN:[^[:space:]]+/){
split(substr($0,RSTART,RLENGTH),arr,":")
print arr[2]
}
' Input_file
3rd solution: Using GNU awk's FPAT option please try following solution. Simple explanation would be, setting FPAT to /CLIENT_LOGIN: followed by all non-spaces values. In main program of awk using sub to substitute everything till : with NULL for first field and then printing first field.
awk -v FPAT='/CLIENT_LOGIN:[^[:space:]]+' '{sub(/.*:/,"",$1);print $1}' Input_file
Performing a regex match and capturing the resulting string in BASH_REMATCH[]:
$ regex='.*/CLIENT_LOGIN:([^[:space:]]*).*'
$ A='something /CLIENT_LOGIN:word1 something else'
$ unset B
$ [[ "${A}" =~ $regex ]] && B="${BASH_REMATCH[1]}"
$ echo "${B}"
word1
Verifying B remains undefined if we don't find our match:
$ A='something without the desired string'
$ unset B
$ [[ "${A}" =~ $regex ]] && B="${BASH_REMATCH[1]}"
$ echo "${B}"
<<<=== nothing output
Fixing your awk command, you can use
A="/CLIENT_IPADDR:23.4.28.2 /CLIENT_LOGIN:xdfmb1d /MXJ_C"
B=$(echo "$A" | awk 'match($0,/\/CLIENT_LOGIN:[^[:space:]]+/){print substr($0,RSTART+14,RLENGTH-14)}')
See the online demo yielding xdfmb1d. Details:
\/CLIENT_LOGIN: - a /CLIENT_LOGIN: string
[^[:space:]]+ - one or more non-whitespace chars
The pattern above is what awk searches for, and once matched, the part of this match value after /CLIENT_LOGIN: is "extracted" using substr($0,RSTART+14,RLENGTH-14) (where 14 is the length of the /CLIENT_LOGIN: string).

How do I take only the first occurrence of a hyphen in sed?

I have a string, for example home/JOHNSMITH-4991-common-task-list, and I want to take out the uppercase part and the numbers with the hyphen between them. I echo the string and pipe it to sed like so, but I keep getting all the hyphens I don't want, e.g.:
echo home/JOHNSMITH-4991-common-task-list | sed 's/[^A-Z0-9-]//g'
gives me:
JOHNSMITH-4991---
I need:
JOHNSMITH-4991
How do I ignore all but the first hyphen?
You can use
sed 's,.*/\([^-]*-[^-]*\).*,\1,'
POSIX BRE regex details:
.* - any zero or more chars
/ - a / char
\([^-]*-[^-]*\) - Group 1: any zero or more chars other than -, a hyphen, and then again zero or more chars other than -
.* - any zero or more chars
The replacement is the Group 1 placeholder, \1, to restore just the text captured.
See the online demo:
#!/bin/bash
s="home/JOHNSMITH-4991-common-task-list"
sed 's,.*/\([^-]*-[^-]*\).*,\1,' <<< "$s"
# => JOHNSMITH-4991
1st solution: With awk it will be much easier and we could keep it simple, please try following, written and tested with your shown samples.
echo "echo home/JOHNSMITH-4991-common-task-list" | awk -F'/|-' '{print $2"-"$3}'
Explanation: Simple explanation would be, setting field separator as / OR - and printing 2nd field - and 3rd field of current line.
2nd solution: Using match function of awk program here.
echo "echo home/JOHNSMITH-4991-common-task-list" |
awk '
match($0,/\/[^-]*-[^-]*/){
print substr($0,RSTART+1,RLENGTH-1)
}'
3rd solution: Using GNU grep solution here. Using -oP option of grep here, to print matched values with o option and to enable ERE(extended regular expression) with P option. Then in main program of grep using .*/ followed by \K to ignore previous matched part and then mentioning [^-]*-[^-]* to make sure to get values just before 2nd occurrence of - in matched line.
echo "echo home/JOHNSMITH-4991-common-task-list" | grep -oP '.*/\K[^-]*-[^-]*'
Here is a simple alternative solution using cut with bash string substitution:
s='home/JOHNSMITH-4991-common-task-list'
cut -d- -f1-2 <<< "${s##*/}"
JOHNSMITH-4991
You could match until the first occurrence of the /, then clear the match buffer with \K and then repeat the character class 1+ times with a hyphen in between to select at least characters before and after the hyphen.
[^/]*/\K[A-Z0-9]+-[A-Z0-9]+
If supported, using gnu grep:
echo "echo home/JOHNSMITH-4991-common-task-list" | grep -oP '[^/]*/\K[A-Z0-9]+-[A-Z0-9]+'
Output
JOHNSMITH-4991
If gnu awk is an option, using the same pattern but with a capture group:
echo "home/JOHNSMITH-4991-common-task-list" | awk 'match($0, /[^\/]*\/([A-Z0-9]+-[A-Z0-9]+)/, a) {print a[1]}'
If the desired output is always the first match where the character class with a hyphen matches:
echo "home/JOHNSMITH-4991-common-task-list" | awk -v FPAT="[A-Z0-9]+-[A-Z0-9]+" '$0=$1'
Output
JOHNSMITH-4991
Assumptions:
could be more than one fwd slash in string
(after the last fwd slash) there are 2 or more hyphens in the string
desired output is between last fwd slash and 2nd hyphen
One idea using parameter substitutions:
$ string='home/dir/JOHNSMITH-4991-common-task-list'
$ string1="${string##*/}"
$ typeset -p string1
declare -- string1="JOHNSMITH-4991-common-task-list"
$ string1="${string1%%-*}"
$ typeset -p string1
declare -- string1="JOHNSMITH"
$ string2="${string#*-}"
$ typeset -p string2
declare -- string2="4991-common-task-list"
$ string2="${string2%%-*}"
$ typeset -p string2
declare -- string2="4991"
$ newstring="${string1}-${string2}"
$ echo "${newstring}"
JOHNSMITH-4991
NOTES:
typeset commands added solely to show progression of values
a bit of typing but if doing this a lot of times in bash the overall performance should be good compared to other solutions that require spawning a sub-process
if there's a need to parse a large number of strings best performance will come from streaming all strings at once (via a file?) to one of the other solutions (eg, a single awk call that processes all strings will be faster than running the set of strings through a bash loop and performing all of these parameter substitutions)

How to extract text between first 2 dashes in the string using sed or grep in shell

I have the string like this feature/test-111-test-test.
I need to extract string till the second dash and change forward slash to dash as well.
I have to do it in Makefile using shell syntax and there for me doesn't work some regular expression which can help or this case
Finally I have to get smth like this:
input - feature/test-111-test-test
output - feature-test-111- or at least feature-test-111
feature/test-111-test-test | grep -oP '\A(?:[^-]++-??){2}' | sed -e 's/\//-/g')
But grep -oP doesn't work in my case. This regexp doesn't work as well - (.*?-.*?)-.*.
Another sed solution using a capture group and regex/pattern iteration (same thing Socowi used):
$ s='feature/test-111-test-test'
$ sed -E 's/\//-/;s/^(([^-]*-){3}).*$/\1/' <<< "${s}"
feature-test-111-
Where:
-E - enable extended regex support
s/\//-/ - replace / with -
s/^....*$/ - match start and end of input line
(([^-]-){3}) - capture group #1 that consists of 3 sets of anything not - followed by -
\1 - print just the capture group #1 (this will discard everything else on the line that's not part of the capture group)
To store the result in a variable:
$ url=$(sed -E 's/\//-/;s/^(([^-]*-){3}).*$/\1/' <<< "${s}")
$ echo $url
feature-test-111-
You can use awk keeping in mind that in Makefile the $ char in awk command must be doubled:
url=$(shell echo 'feature/test-111-test-test' | awk -F'-' '{gsub(/\//, "-", $$1);print $$1"-"$$2"-"}')
echo "$url"
# => feature-test-111-
See the online demo. Here, -F'-' sets the field delimiter as -, gsub(/\//, "-", $1) replaces / with - in Field 1 and print $1"-"$2"-" prints the value of --separated Field 1 and 2.
Or, with a regex as a field delimiter:
url=$(shell echo 'feature/test-111-test-test' | awk -F'[-/]' '{print $$1"-"$$2"-"$$3"-"}')
echo "$url"
# => feature-test-111-
The -F'[-/]' option sets the field separator to - and /.
The '{print $1"-"$2"-"$3"-"}' part prints the first, second and third value with a separating hyphen.
See the online demo.
To get the nth occurrence of a character C you don't need fancy perl regexes. Instead, build a regex of the form "(anything that isn't C, then C) for n times":
grep -Eo '([^-]*-){2}' | tr / -
With sed and cut
echo feature/test-111-test-test| cut -d'-' -f-2 |sed 's/\//-/'
Output
feature-test-111
echo feature/test-111-test-test| cut -d'-' -f-2 |sed 's/\//-/;s/$/-/'
Output
feature-test-111-
You can use the simple BRE regex form of not something then that something which is [^-]*- to get all characters other than - up to a -.
This works:
echo 'feature/test-111-test-test' | sed -nE 's/^([^/]*)\/([^-]*-[^-]*-).*/\1-\2/p'
feature-test-111-
Another idea using parameter expansions/substitutions:
s='feature/test-111-test-test'
tail="${s//\//-}" # replace '/' with '-'
# split first field from rest of fields ('-' delimited); do this 3x times
head="${tail%%-*}" # pull first field
tail="${tail#*-}" # drop first field
head="${head}-${tail%%-*}" # pull first field; append to previous field
tail="${tail#*-}" # drop first field
head="${head}-${tail%%-*}-" # pull first field; append to previous fields; add trailing '-'
$ echo "${head}"
feature-test-111-
A short sed solution, without extended regular expressions:
sed 's|\(.*\)/\([^-]*-[^-]*\).*|\1-\2|'

Extract QueryString value using sed

I have the following lines in an apache access log
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229655&blah
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229656&blah
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229657&blah
/sms/receiveHLRLookup?Ported=No&Status=Success&MSISDN=647930229658&blah
and i want to extract the MSISDN value only, so expected output would be
647930229655
647930229656
647930229657
647930229658
I'm using the following sed command but i can't get it to stop capturing at &
sed 's/.*MSISDN=\(.*\)/\1/'
sed solution:
sed -E 's/.*&MSISDN=([^&]+).*/\1/' file
& - is key/value pair separator in URL syntax, so you should rely on it
([^&]+) - 1st captured group containing any character sequence except &
\1 - backreference to the 1st captured group
The output:
647930229655
647930229656
647930229657
647930229658
-o : means print only matching string not the whole line.
-P: To enable pcre regex.
\K: means ignore everything on the left. But should be part of actual input string.
\d: means digit, + means one or more digit.
grep -oP 'MSISDN=\K\d+' input
647930229655
647930229656
647930229657
647930229658
Following simple sed may help you on same.
sed 's/.*MSISDN=//;s/&.*//' Input_file
Explanation:
s/.*MSISDN=//: s means substitute .*MSISDN= string with // NULL in current line.
; semi colon tells sed that there is 1 more statement to be executed.
s/&.*//g': s/&.*// means substitute &.* from & to everything with NULL.
$ grep -oP '(?<=&MSISDN=)\d+' file
647930229655
647930229656
647930229657
647930229658
-o option is meant to show only matched output
-P option is meant to enable PCRE (Perl Compatible Regex)
(?<=regex) this is to enforce positive look behind assertion. You can read more about them over here. Lookarounds dont consume any characters while matching unlike normal regex. Hence the only matched output you get it \d+ which is 1 or more digits.
or using sed:
$ sed -r 's/^.*MSISDN=([0-9]+).*$/\1/' file
647930229655
647930229656
647930229657
647930229658
you can also pipe cut to cut
cut -d '&' -f3 Input_file |cut -d '=' -f2