Capture word after pattern with slash - regex

I want to extract word1 from:
something /CLIENT_LOGIN:word1 something else
I would like to extract the first word after matching pattern /CLIENT_LOGIN:.
Without the slash, something like this works:
A=something /CLIENT_LOGIN:word1 something else
B=$(echo $A | awk '$1 == "CLIENT_LOGIN" { print $2 }' FS=":")
With the slash though, I can't get it working (I tried putting / and \/ in front of CLIENT_LOGIN). I don't care getting it done with awk, grep, sed, ...

Using sed:
s='=something /CLIENT_LOGIN:word1 something else'
sed -E 's~.* /CLIENT_LOGIN:([^[:blank:]]+).*~\1~' <<< "$s"
word1
Details:
We use ~ as regex delimiter in sed
/CLIENT_LOGIN:([^[:blank:]]+) matches /CLIENT_LOGIN: followed by 1+ non-whitespace characters that is captured in group #1
.* on both sides matches text before and after our match
\1 is used in substitution to put 1st group's captured value back in output

1st solution: With your shown samples, please try following GNU grep solution.
grep -oP '^.*? /CLIENT_LOGIN:\K(\S+)' Input_file
Explanation: Simple explanation would be, using GNU grep's o and P options. Which are responsible for printing exact match and enabling PCRE regex. In main program, using regex ^.*? /CLIENT_LOGIN:\K(\S+): which means using lazy match from starting of value to till /CLIENT_LOGIN: to match very first occurrence of string. Then using \K option to forget till now matched values so tat we can print only required values, which is followed by \S+ which means match all NON-Spaces before any space comes.
2nd solution: Using awk's match function along with its split function to print the required value.
awk '
match($0,/\/CLIENT_LOGIN:[^[:space:]]+/){
split(substr($0,RSTART,RLENGTH),arr,":")
print arr[2]
}
' Input_file
3rd solution: Using GNU awk's FPAT option please try following solution. Simple explanation would be, setting FPAT to /CLIENT_LOGIN: followed by all non-spaces values. In main program of awk using sub to substitute everything till : with NULL for first field and then printing first field.
awk -v FPAT='/CLIENT_LOGIN:[^[:space:]]+' '{sub(/.*:/,"",$1);print $1}' Input_file

Performing a regex match and capturing the resulting string in BASH_REMATCH[]:
$ regex='.*/CLIENT_LOGIN:([^[:space:]]*).*'
$ A='something /CLIENT_LOGIN:word1 something else'
$ unset B
$ [[ "${A}" =~ $regex ]] && B="${BASH_REMATCH[1]}"
$ echo "${B}"
word1
Verifying B remains undefined if we don't find our match:
$ A='something without the desired string'
$ unset B
$ [[ "${A}" =~ $regex ]] && B="${BASH_REMATCH[1]}"
$ echo "${B}"
<<<=== nothing output

Fixing your awk command, you can use
A="/CLIENT_IPADDR:23.4.28.2 /CLIENT_LOGIN:xdfmb1d /MXJ_C"
B=$(echo "$A" | awk 'match($0,/\/CLIENT_LOGIN:[^[:space:]]+/){print substr($0,RSTART+14,RLENGTH-14)}')
See the online demo yielding xdfmb1d. Details:
\/CLIENT_LOGIN: - a /CLIENT_LOGIN: string
[^[:space:]]+ - one or more non-whitespace chars
The pattern above is what awk searches for, and once matched, the part of this match value after /CLIENT_LOGIN: is "extracted" using substr($0,RSTART+14,RLENGTH-14) (where 14 is the length of the /CLIENT_LOGIN: string).

Related

Extract string between underscores and dot

I have strings like these:
/my/directory/file1_AAA_123_k.txt
/my/directory/file2_CCC.txt
/my/directory/file2_KK_45.txt
So basically, the number of underscores is not fixed. I would like to extract the string between the first underscore and the dot. So the output should be something like this:
AAA_123_k
CCC
KK_45
I found this solution that works:
string='/my/directory/file1_AAA_123_k.txt'
tmp="${string%.*}"
echo $tmp | sed 's/^[^_:]*[_:]//'
But I am wondering if there is a more 'elegant' solution (e.g. 1 line code).
With bash version >= 3.0 and a regex:
[[ "$string" =~ _(.+)\. ]] && echo "${BASH_REMATCH[1]}"
You can use a single sed command like
sed -n 's~^.*/[^_/]*_\([^/]*\)\.[^./]*$~\1~p' <<< "$string"
sed -nE 's~^.*/[^_/]*_([^/]*)\.[^./]*$~\1~p' <<< "$string"
See the online demo. Details:
^ - start of string
.* - any text
/ - a / char
[^_/]* - zero or more chars other than / and _
_ - a _ char
\([^/]*\) (POSIX BRE) / ([^/]*) (POSIX ERE, enabled with E option) - Group 1: any zero or more chars other than /
\. - a dot
[^./]* - zero or more chars other than . and /
$ - end of string.
With -n, default line output is suppressed and p only prints the result of successful substitution.
With your shown samples, with GNU grep you could try following code.
grep -oP '.*?_\K([^.]*)' Input_file
Explanation: Using GNU grep's -oP options here to print exact match and to enable PCRE regex respectively. In main program using regex .*?_\K([^.]*) to get value between 1st _ and first occurrence of .. Explanation of regex is as follows:
Explanation of regex:
.*?_ ##Matching from starting of line to till first occurrence of _ by using lazy match .*?
\K ##\K will forget all previous matched values by regex to make sure only needed values are printed.
([^.]*) ##Matching everything till first occurrence of dot as per need.
A simpler sed solution without any capturing group:
sed -E 's/^[^_]*_|\.[^.]*$//g' file
AAA_123_k
CCC
KK_45
If you need to process the file names one at a time (eg, within a while read loop) you can perform two parameter expansions, eg:
$ string='/my/directory/file1_AAA_123_k.txt.2'
$ tmp="${string#*_}"
$ tmp="${tmp%%.*}"
$ echo "${tmp}"
AAA_123_k
One idea to parse a list of file names at the same time:
$ cat file.list
/my/directory/file1_AAA_123_k.txt.2
/my/directory/file2_CCC.txt
/my/directory/file2_KK_45.txt
$ sed -En 's/[^_]*_([^.]+).*/\1/p' file.list
AAA_123_k
CCC
KK_45
Using sed
$ sed 's/[^_]*_//;s/\..*//' input_file
AAA_123_k
CCC
KK_45
This is easy, except that it includes the initial underscore:
ls | grep -o "_[^.]*"

find recurring pattern with `sed`

I am using GNU bash 4.3.48
I expected that
echo "23S62M1I19M2D" | sed 's/.*\([0-9]*M\).*/\1/g'
would output 62M19M... But it doesn't.
sed 's/\([0-9]*M\)//g' deletes ALL [0-9]*M and retrieves 23S1I2D. but the group \1 is not working as I thought it would.
sed 's/.*\([0-9]*M\).*/ \1 /g', retrieves M...
What am I doing wrong?
Thank you!
With your shown samples and with awk you could try following program.
echo "23S62M1I19M2D" |
awk '
{
val=""
while(match($0,/[0-9]+M/)){
val=val substr($0,RSTART,RLENGTH)
$0=substr($0,RSTART+RLENGTH)
}
print val
}
'
Explanation: Simple explanation would be, using echo to print values and sending it as a standard input to awk program. In awk program using its match function to match regex mentioned in it(/[0-9]+M) running loop to find all matches in each line and printing the collected matched values at last of each line.
This might work for you (GNU sed):
sed -nE '/[0-9]*M/{s//\n&\n/g;s/(^|\n)[^\n]*\n?//gp}' file
Surround the match by newlines and then remove non-matching parts.
Alternative, using grep and tr:
grep -o '[0-9]*M' file | tr -d '\n'
N.B. tr removes all newlines (including the last one) to restore the last newline, use:
grep -o '[0-9]*M' file | tr -d '\n' | paste
The alternate solution will concatenate all results into a single line. To achieve the same result with the first solution use:
sed -nE '/[0-9]*M/{s//\n&\n/g;s/(^|\n)[^\n]*\n?//g;H};${x;s/\n//gp}' file
The problem is that the .* is greedy. Since only M is obligatory, when the engine finds last M, it satisfies the regex, so all string is matched, M is captured and thus kept after replacing with \1 backreference.
That means, you can't easily do this with sed. You can do that with Perl much easier since it supports matching and skipping pattern:
#!/bin/bash
perl -pe 's/\d+M(*SKIP)(*F)|.//g' <<< "23S62M1I19M2D"
See the online demo. The pattern matches
\d+M(*SKIP)(*F) - one or more digits, M, and then the match is omitted and the next match is searched for from the failure position
|. - or matches any char other than a line break char.
Or simply match all occurrences and concatenate them:
perl -lane 'BEGIN{$a="";} while (/\d+M/g) {$a .= $&} END{print $a;}' <<< "23S62M1I19M2D"
All \d+M matches are appended to the $a variable which is printed at the end of processing the string.
Your substitution is probably working, but not substituting what you think it is.
In the substitution s/\(foo...\)/\1/, the \1 matches whatever \(...\) matches and captures, so your substitution is replacing foo... by foo...!
% echo "1234ABC" | sed 's/\([A-Z]\)/-\1-/'g
1234-A--B--C-
So you'll need to match more, but capture only a portion of the match. For example:
echo "23S62M1I19M2D" | sed 's/[0-9]*[A-LN-Z]*\([0-9]*M\)/\1/g'
62M19M2D
In the case of sed 's/.*\([0-9]*M\).*/\1/g' (did that appear in an edit to the question, or did I just miss it?), the .* matches ‘greedily’ – it matches as much as it possibly can, thus including the digits before the M. In the example above, the [A-LN-Z] is required to be at the end of the uncaptured part, so the digits are forced to be matched by the [0-9] inside the capture.
Getting a clear idea of what ‘greedy’ means is a really important idea when writing or interpreting regexps.
If you know you will only encounter the suffixes S, M, I and D, an alternative approach would be explicitly deleting the combinations you don't want:
echo "23S62M1I19M2D" | sed 's/[0-9]\+[SID]//g'
This gives the expected:
62M19M
Update: This variant produces the same output, but rejects all non-numeric, non-M suffixes:
echo "23S62M1I19M2D" | sed 's/[0-9]\+[^0-9M]//g'

How do I take only the first occurrence of a hyphen in sed?

I have a string, for example home/JOHNSMITH-4991-common-task-list, and I want to take out the uppercase part and the numbers with the hyphen between them. I echo the string and pipe it to sed like so, but I keep getting all the hyphens I don't want, e.g.:
echo home/JOHNSMITH-4991-common-task-list | sed 's/[^A-Z0-9-]//g'
gives me:
JOHNSMITH-4991---
I need:
JOHNSMITH-4991
How do I ignore all but the first hyphen?
You can use
sed 's,.*/\([^-]*-[^-]*\).*,\1,'
POSIX BRE regex details:
.* - any zero or more chars
/ - a / char
\([^-]*-[^-]*\) - Group 1: any zero or more chars other than -, a hyphen, and then again zero or more chars other than -
.* - any zero or more chars
The replacement is the Group 1 placeholder, \1, to restore just the text captured.
See the online demo:
#!/bin/bash
s="home/JOHNSMITH-4991-common-task-list"
sed 's,.*/\([^-]*-[^-]*\).*,\1,' <<< "$s"
# => JOHNSMITH-4991
1st solution: With awk it will be much easier and we could keep it simple, please try following, written and tested with your shown samples.
echo "echo home/JOHNSMITH-4991-common-task-list" | awk -F'/|-' '{print $2"-"$3}'
Explanation: Simple explanation would be, setting field separator as / OR - and printing 2nd field - and 3rd field of current line.
2nd solution: Using match function of awk program here.
echo "echo home/JOHNSMITH-4991-common-task-list" |
awk '
match($0,/\/[^-]*-[^-]*/){
print substr($0,RSTART+1,RLENGTH-1)
}'
3rd solution: Using GNU grep solution here. Using -oP option of grep here, to print matched values with o option and to enable ERE(extended regular expression) with P option. Then in main program of grep using .*/ followed by \K to ignore previous matched part and then mentioning [^-]*-[^-]* to make sure to get values just before 2nd occurrence of - in matched line.
echo "echo home/JOHNSMITH-4991-common-task-list" | grep -oP '.*/\K[^-]*-[^-]*'
Here is a simple alternative solution using cut with bash string substitution:
s='home/JOHNSMITH-4991-common-task-list'
cut -d- -f1-2 <<< "${s##*/}"
JOHNSMITH-4991
You could match until the first occurrence of the /, then clear the match buffer with \K and then repeat the character class 1+ times with a hyphen in between to select at least characters before and after the hyphen.
[^/]*/\K[A-Z0-9]+-[A-Z0-9]+
If supported, using gnu grep:
echo "echo home/JOHNSMITH-4991-common-task-list" | grep -oP '[^/]*/\K[A-Z0-9]+-[A-Z0-9]+'
Output
JOHNSMITH-4991
If gnu awk is an option, using the same pattern but with a capture group:
echo "home/JOHNSMITH-4991-common-task-list" | awk 'match($0, /[^\/]*\/([A-Z0-9]+-[A-Z0-9]+)/, a) {print a[1]}'
If the desired output is always the first match where the character class with a hyphen matches:
echo "home/JOHNSMITH-4991-common-task-list" | awk -v FPAT="[A-Z0-9]+-[A-Z0-9]+" '$0=$1'
Output
JOHNSMITH-4991
Assumptions:
could be more than one fwd slash in string
(after the last fwd slash) there are 2 or more hyphens in the string
desired output is between last fwd slash and 2nd hyphen
One idea using parameter substitutions:
$ string='home/dir/JOHNSMITH-4991-common-task-list'
$ string1="${string##*/}"
$ typeset -p string1
declare -- string1="JOHNSMITH-4991-common-task-list"
$ string1="${string1%%-*}"
$ typeset -p string1
declare -- string1="JOHNSMITH"
$ string2="${string#*-}"
$ typeset -p string2
declare -- string2="4991-common-task-list"
$ string2="${string2%%-*}"
$ typeset -p string2
declare -- string2="4991"
$ newstring="${string1}-${string2}"
$ echo "${newstring}"
JOHNSMITH-4991
NOTES:
typeset commands added solely to show progression of values
a bit of typing but if doing this a lot of times in bash the overall performance should be good compared to other solutions that require spawning a sub-process
if there's a need to parse a large number of strings best performance will come from streaming all strings at once (via a file?) to one of the other solutions (eg, a single awk call that processes all strings will be faster than running the set of strings through a bash loop and performing all of these parameter substitutions)

linux extract only a string starts with a special string and ends with the first occurrence of comma

I have a log file contains some information like below
"variable1=XXX, emotionType=sad, sentimentType=negative..."
What I want is to grep only the matched string, the string starts with emotionType and ends with the first occurrence of comma.
E.g.
emotionType=sad
emotionType=joy
...
What I have tried is
grep -e "/^emotionType.*,/" file.log -o
but I got nothing. Anyone can tell me what should I do?
You need to use
grep -o "emotionType[^,]*" file.log
Note:
Remove ^ or replace with \<, starting word boundary construct if your matches are not located at the beginning of each line
Remove the / chars on both ends of the regex since grep does not use regex delimiters (like sed)
[^,] is a negated bracket expression that matches any char other than a comma
* is a POSIX BRE quantifier that matches zero or more occurrences.
See an online demo:
#!/bin/bash
s="variable1=XXX, emotionType=sad, sentimentType=negative, emotionType=happy"
grep -o "emotionType=[^,]*" <<< "$s"
Output:
emotionType=sad
emotionType=happy
1st solution: With awk you could try following program. Simple explanation would be using awk's match function capability and using regex to match string emotionType till next occurrence of , and printing all the matches in awk program.
var="variable1=XXX, emotionType=sad, sentimentType=negative, emotionType=happy"
Where var is a shell variable.
echo "$var" |
awk '{while(match($0,/emotionType=[^,]*/)){print substr($0,RSTART,RLENGTH);$0=substr($0,RSTART+RLENGTH)}}'
2nd solution: Or in GNU awk using RS variable try following awk program.
echo "$var" | awk -v RS='emotionType=[^,]*' 'RT{sub(/\n+$/,"",RT);print RT}'

sed and Perl regexp replaces once, with multiple replacements flag

I have the string:
lopy,lopy1,sym,lopy,lopy1,sym"
I want the line to be:
lopy,lopy1,sym,lady,lady1,sym
Which means that all "lad" after the string sym should be replaced. So I ran:
echo "lopy,lopy1,sym,lopy,lopy1,sym" | sed -r 's/(.*sym.*?)lopy/\1lad/g'
I get:
lopy,lopy1,sym,lopy,lad1,sym
Using Perl is not really better:
echo "lopy,lopy1,sym,lopy,lopy1,sym" | perl -pe 's/(.*sym.+?)lopy/${1}lad/g'
yields
lopy,lopy1,sym,lad,lopy1,sym
Not all "lopy" are replaced. What am I doing wrong?
The (.*sym.*?)lopy / (.*sym.+?)lopy patterns are almost the same, .+? matches one or more chars other than line break chars, but as few as possible, and .*? matches zero or more such chars. Mind that sed does not support lazy quantifiers, *? is the same as * in sed. However, the main problem with the regexps you used is that they match sym, then any text after it and then lopy, so when you added g, it just means you want to find more cases of lopy after sym....lopy. And there is only one such occurrence in your string.
You want to replace all lopy after sym, so you can use
perl -pe 's/(?:\G(?!^)|sym).*?\Klopy/lad/g'
See the regex demo. Details:
(?:\G(?!^)|sym) - sym or end of the previous match (\G(?!^))
.*? - any zero or more chars other than line break chars, as few as possible
\K - match reset operator that discards all text matched so far
lopy - a lopy string.
See the online demo:
#!/bin/bash
echo "lopy,lopy1,sym,lopy,lopy1,sym" | perl -pe 's/(?:\G(?!^)|sym).*?\Klopy/lad/g'
# => lopy,lopy1,sym,lad,lad1,sym
If the values are always comma separated, you may replace .*? with ,: (?:\G(?!^)|sym),\Klopy (see this regex demo).
Since OP has mentioned sed so I am adding awk program here. Which could be better choice in comparison to sed. With shown samples, please try following awk program.
echo "lopy,lopy1,sym,lopy,lopy1,sym" |
awk -F',sym,' '
{
first=$1
$1=""
sub(/^[[:space:]]+/,"")
gsub(/lop/,"lad")
$0=first FS $0
}
1
'
Explanation: Adding detailed explanation for above.
echo "lopy,lopy1,sym,lopy,lopy1,sym" | ##Printing values and sending as standard output to awk program as an input.
awk -F',sym,' ' ##Making ,sym, as a field separator here.
{
first=$1 ##Creating first which has $1 of current line in it.
$1="" ##Nullifying $1 here.
sub(/^[[:space:]]+/,"") ##Substituting initial space in current line here.
gsub(/lop/,"lad") ##Globally substituting lop with lad in rest of line.
$0=first FS $0 ##Adding first FS to rest of edited line here.
}
1 ##Printing edited/non-edited line value here.
'
The problem is that the lopy(s) to replace are after sym, with a pattern like sym.*?lopy, so a global replacement looks for yet more of the whole sym+lopy-after-sym (not just for all lopys after that one sym).†
To replace all lopys (after the first sym, followed by another sym) we can capture the substring between syms and in the replacement side run code, in which a regex replaces all lopys
echo "lopy,lopy1,sym,lopy,lopy1,sym" |
perl -pe's{ sym,\K (.+?) (?=sym) }{ $1 =~ s/lop/lad/gr }ex'
To isolate the substring between syms I use \K after the first sym, which drops matches prior to it, and a positive lookahead for the sym after the substring, which doesn't consume anything. The /e modifier makes the replacement side be evaluated as code. In the replacement side's regex we need /r since $1 can't change, and we want the regex to return anyway. See perlretut.
† To match all of abbbb we can't say /ab/g, nor /(a)b/g nor /a(b)/g, because that would look for all repetitions of the whole ab in the string (and find only ab in the beginning).
sed does not support non-greedy wildcards at all. But your Perl script also fails for other reasons; you are saying "match all occurrences of this" but then you specify a regex which can only match once.
A common simple solution is to split the string, and then replace only after the match:
echo "lopy,lopy1,sym,lopy,lopy1,sym" |
perl -pe 'if (#x = /^(.*?sym,)(.*)/) { $x[1] =~ s/lop/lad/g; s/.*/$x[0]$x[1]/ }'
If you want to be fancy, you can use a lookbehind to only replace the lop occurrences after the first sym.
echo "lopy,lopy1,sym,lopy,lopy1,sym" |
perl -pe 's/(?<=sym.{0,200})lop/lad/'
The variable-length lookbehind generates a warning and is only supported in Perl 5.30+ (you can turn it off with no warnings qw(experimental::vlb));.)
Since you have shown an attempted sed command and used sed tag, here is a sed loop based solution:
sed -E -e ':a' -e 's~(sym,.*)lopy~\1lady~g; ta' file
lopy,lopy1,sym,lady,lady1,sym"
Explanation:
:a sets a label a before matching sym,.* pattern
ta jumps pattern matching back to label a after making a substitution
This looping stop when s command has nothing to match i.e. no lopy substring after sym,