Linux extract text between specific strings - regex

I have multiple files with different job names. The job name is specified as follows.
#SBATCH --job-name=01_job1 #Set the job name
I want to use sed/awk/grep to automatically get the name, that is to say, what follows '--job-name=' and precedes the comment '#Set the job name'. For the example above, I want to get 01_job1. The job name could be longer for several files, and there are multiple = signs in following lines in the file.
I have tried using grep -oP "job-name=\s+\K\w+" file and get an empty output. I suspect that this doesn't work because there is no space between 'name=' and '01_job1', so they must be understood as a single word.
I also unsuccessfully tried using awk '{for (I=1;I<NF;I++) if ($I == "name=") print $(I+1)}' file, attempting to find the characters after 'name='.
Lastly, I also unsuccessfully tried sed -e 's/name=\(.*\)#Set/\1/' file to find the characters between 'name=' and the beginning of the comment '#Set'. I receive the whole file as my output when I attempt this.
I appreciate any guidance. Thank you!!

You need to match the whole string with sed and capture just what you need to get, and use -n option with the p flag:
sed -n 's/.*name=\([^[:space:]]*\).*/\1/p'
See the online demo:
#!/bin/bash
s='#SBATCH --job-name=01_job1 #Set the job name'
sed -n 's/.*name=\([^[:space:]]*\).*/\1/p' <<< "$s"
# => 01_job1
Details:
-n - suppresses default line output
.* - any text
name= - a literal name= string
\([^[:space:]]*\) - Group 1 (\1): any zero or more chars other than whitespace
.* - any text
p - print the result of the successful substitution.

Simlar to the answer of Gilles Quenot
grep -oP -- '--job-name=\K.*(?= *# *Set the job name)'
This adds a look-ahead to ensure that the string is followed by #Set the job name

1st solution: In GNU awk with your shown samples please try following awk code.
awk -v RS=' --job-name=\\S+' 'RT && split(RT,arr,"="){print arr[2]}' Input_file
OR a non-one liner form of above GNU awk code would be:
awk -v RS=' --job-name=\\S+' '
RT && split(RT,arr,"="){
print arr[2]
}
' Input_file
2nd solution: Using any awk please try following code.
awk -F'[[:space:]]+|--job-name=' '{print $3}' Input_file
3rd solution: Using GNU grep please try following code with your shown samples and using non-greedy .*? approach here in regex.
grep -oP '^.*?--job-name=\K\S+' Input_file

Use this, you was close, just correctness of your grep -oP attempt (the main issue if you are trying to match a space after = character):
$ grep -oP -- '--job-name=\K\S+' file
01_job1
The regular expression matches as follows:
Node
Explanation
job-name=
'job-name='
\K
resets the start of the match (what is Kept) as a shorter alternative to using a look-behind assertion: perlmonks look arounds and Support of K in regex
\S+
non-whitespace (all but \n, \r, \t, \f, and " ") (1 or more times (matching the most amount possible))

You can use a lookbehind and lookahead with GNU grep to get exactly what you describe:
grep -oP '(?<=--job-name=)\S+(?=\s+#Set the job name)' file
Or with awk:
awk '/^#SBATCH[[:space:]]+--job-name=/ &&
/#Set the job name$/ {
sub(/^[^=]*=/,"")
sub(/#[^#]*$/,"")
print
}' file
Or perl:
perl -lnE 'say $1 if /(?<=--job-name=)(\S+)(?=\s+#Set the job name)/' file
Any prints:
01_job1

Related

Capture word after pattern with slash

I want to extract word1 from:
something /CLIENT_LOGIN:word1 something else
I would like to extract the first word after matching pattern /CLIENT_LOGIN:.
Without the slash, something like this works:
A=something /CLIENT_LOGIN:word1 something else
B=$(echo $A | awk '$1 == "CLIENT_LOGIN" { print $2 }' FS=":")
With the slash though, I can't get it working (I tried putting / and \/ in front of CLIENT_LOGIN). I don't care getting it done with awk, grep, sed, ...
Using sed:
s='=something /CLIENT_LOGIN:word1 something else'
sed -E 's~.* /CLIENT_LOGIN:([^[:blank:]]+).*~\1~' <<< "$s"
word1
Details:
We use ~ as regex delimiter in sed
/CLIENT_LOGIN:([^[:blank:]]+) matches /CLIENT_LOGIN: followed by 1+ non-whitespace characters that is captured in group #1
.* on both sides matches text before and after our match
\1 is used in substitution to put 1st group's captured value back in output
1st solution: With your shown samples, please try following GNU grep solution.
grep -oP '^.*? /CLIENT_LOGIN:\K(\S+)' Input_file
Explanation: Simple explanation would be, using GNU grep's o and P options. Which are responsible for printing exact match and enabling PCRE regex. In main program, using regex ^.*? /CLIENT_LOGIN:\K(\S+): which means using lazy match from starting of value to till /CLIENT_LOGIN: to match very first occurrence of string. Then using \K option to forget till now matched values so tat we can print only required values, which is followed by \S+ which means match all NON-Spaces before any space comes.
2nd solution: Using awk's match function along with its split function to print the required value.
awk '
match($0,/\/CLIENT_LOGIN:[^[:space:]]+/){
split(substr($0,RSTART,RLENGTH),arr,":")
print arr[2]
}
' Input_file
3rd solution: Using GNU awk's FPAT option please try following solution. Simple explanation would be, setting FPAT to /CLIENT_LOGIN: followed by all non-spaces values. In main program of awk using sub to substitute everything till : with NULL for first field and then printing first field.
awk -v FPAT='/CLIENT_LOGIN:[^[:space:]]+' '{sub(/.*:/,"",$1);print $1}' Input_file
Performing a regex match and capturing the resulting string in BASH_REMATCH[]:
$ regex='.*/CLIENT_LOGIN:([^[:space:]]*).*'
$ A='something /CLIENT_LOGIN:word1 something else'
$ unset B
$ [[ "${A}" =~ $regex ]] && B="${BASH_REMATCH[1]}"
$ echo "${B}"
word1
Verifying B remains undefined if we don't find our match:
$ A='something without the desired string'
$ unset B
$ [[ "${A}" =~ $regex ]] && B="${BASH_REMATCH[1]}"
$ echo "${B}"
<<<=== nothing output
Fixing your awk command, you can use
A="/CLIENT_IPADDR:23.4.28.2 /CLIENT_LOGIN:xdfmb1d /MXJ_C"
B=$(echo "$A" | awk 'match($0,/\/CLIENT_LOGIN:[^[:space:]]+/){print substr($0,RSTART+14,RLENGTH-14)}')
See the online demo yielding xdfmb1d. Details:
\/CLIENT_LOGIN: - a /CLIENT_LOGIN: string
[^[:space:]]+ - one or more non-whitespace chars
The pattern above is what awk searches for, and once matched, the part of this match value after /CLIENT_LOGIN: is "extracted" using substr($0,RSTART+14,RLENGTH-14) (where 14 is the length of the /CLIENT_LOGIN: string).

How do I take only the first occurrence of a hyphen in sed?

I have a string, for example home/JOHNSMITH-4991-common-task-list, and I want to take out the uppercase part and the numbers with the hyphen between them. I echo the string and pipe it to sed like so, but I keep getting all the hyphens I don't want, e.g.:
echo home/JOHNSMITH-4991-common-task-list | sed 's/[^A-Z0-9-]//g'
gives me:
JOHNSMITH-4991---
I need:
JOHNSMITH-4991
How do I ignore all but the first hyphen?
You can use
sed 's,.*/\([^-]*-[^-]*\).*,\1,'
POSIX BRE regex details:
.* - any zero or more chars
/ - a / char
\([^-]*-[^-]*\) - Group 1: any zero or more chars other than -, a hyphen, and then again zero or more chars other than -
.* - any zero or more chars
The replacement is the Group 1 placeholder, \1, to restore just the text captured.
See the online demo:
#!/bin/bash
s="home/JOHNSMITH-4991-common-task-list"
sed 's,.*/\([^-]*-[^-]*\).*,\1,' <<< "$s"
# => JOHNSMITH-4991
1st solution: With awk it will be much easier and we could keep it simple, please try following, written and tested with your shown samples.
echo "echo home/JOHNSMITH-4991-common-task-list" | awk -F'/|-' '{print $2"-"$3}'
Explanation: Simple explanation would be, setting field separator as / OR - and printing 2nd field - and 3rd field of current line.
2nd solution: Using match function of awk program here.
echo "echo home/JOHNSMITH-4991-common-task-list" |
awk '
match($0,/\/[^-]*-[^-]*/){
print substr($0,RSTART+1,RLENGTH-1)
}'
3rd solution: Using GNU grep solution here. Using -oP option of grep here, to print matched values with o option and to enable ERE(extended regular expression) with P option. Then in main program of grep using .*/ followed by \K to ignore previous matched part and then mentioning [^-]*-[^-]* to make sure to get values just before 2nd occurrence of - in matched line.
echo "echo home/JOHNSMITH-4991-common-task-list" | grep -oP '.*/\K[^-]*-[^-]*'
Here is a simple alternative solution using cut with bash string substitution:
s='home/JOHNSMITH-4991-common-task-list'
cut -d- -f1-2 <<< "${s##*/}"
JOHNSMITH-4991
You could match until the first occurrence of the /, then clear the match buffer with \K and then repeat the character class 1+ times with a hyphen in between to select at least characters before and after the hyphen.
[^/]*/\K[A-Z0-9]+-[A-Z0-9]+
If supported, using gnu grep:
echo "echo home/JOHNSMITH-4991-common-task-list" | grep -oP '[^/]*/\K[A-Z0-9]+-[A-Z0-9]+'
Output
JOHNSMITH-4991
If gnu awk is an option, using the same pattern but with a capture group:
echo "home/JOHNSMITH-4991-common-task-list" | awk 'match($0, /[^\/]*\/([A-Z0-9]+-[A-Z0-9]+)/, a) {print a[1]}'
If the desired output is always the first match where the character class with a hyphen matches:
echo "home/JOHNSMITH-4991-common-task-list" | awk -v FPAT="[A-Z0-9]+-[A-Z0-9]+" '$0=$1'
Output
JOHNSMITH-4991
Assumptions:
could be more than one fwd slash in string
(after the last fwd slash) there are 2 or more hyphens in the string
desired output is between last fwd slash and 2nd hyphen
One idea using parameter substitutions:
$ string='home/dir/JOHNSMITH-4991-common-task-list'
$ string1="${string##*/}"
$ typeset -p string1
declare -- string1="JOHNSMITH-4991-common-task-list"
$ string1="${string1%%-*}"
$ typeset -p string1
declare -- string1="JOHNSMITH"
$ string2="${string#*-}"
$ typeset -p string2
declare -- string2="4991-common-task-list"
$ string2="${string2%%-*}"
$ typeset -p string2
declare -- string2="4991"
$ newstring="${string1}-${string2}"
$ echo "${newstring}"
JOHNSMITH-4991
NOTES:
typeset commands added solely to show progression of values
a bit of typing but if doing this a lot of times in bash the overall performance should be good compared to other solutions that require spawning a sub-process
if there's a need to parse a large number of strings best performance will come from streaming all strings at once (via a file?) to one of the other solutions (eg, a single awk call that processes all strings will be faster than running the set of strings through a bash loop and performing all of these parameter substitutions)

linux extract only a string starts with a special string and ends with the first occurrence of comma

I have a log file contains some information like below
"variable1=XXX, emotionType=sad, sentimentType=negative..."
What I want is to grep only the matched string, the string starts with emotionType and ends with the first occurrence of comma.
E.g.
emotionType=sad
emotionType=joy
...
What I have tried is
grep -e "/^emotionType.*,/" file.log -o
but I got nothing. Anyone can tell me what should I do?
You need to use
grep -o "emotionType[^,]*" file.log
Note:
Remove ^ or replace with \<, starting word boundary construct if your matches are not located at the beginning of each line
Remove the / chars on both ends of the regex since grep does not use regex delimiters (like sed)
[^,] is a negated bracket expression that matches any char other than a comma
* is a POSIX BRE quantifier that matches zero or more occurrences.
See an online demo:
#!/bin/bash
s="variable1=XXX, emotionType=sad, sentimentType=negative, emotionType=happy"
grep -o "emotionType=[^,]*" <<< "$s"
Output:
emotionType=sad
emotionType=happy
1st solution: With awk you could try following program. Simple explanation would be using awk's match function capability and using regex to match string emotionType till next occurrence of , and printing all the matches in awk program.
var="variable1=XXX, emotionType=sad, sentimentType=negative, emotionType=happy"
Where var is a shell variable.
echo "$var" |
awk '{while(match($0,/emotionType=[^,]*/)){print substr($0,RSTART,RLENGTH);$0=substr($0,RSTART+RLENGTH)}}'
2nd solution: Or in GNU awk using RS variable try following awk program.
echo "$var" | awk -v RS='emotionType=[^,]*' 'RT{sub(/\n+$/,"",RT);print RT}'

Sed version extract

I am trying to extract the version number from a string. I am unable to find the exact regex to find what I need.
For eg -
1012-EPS-Test-OF-Something-1.3
I need sed to only extract 1.3 from the above line.
I have tried quite a few things until now something like but it is clearly not working out
sed 's/[^0-9.0-9]*//')
With your shown samples, easiest way could be. Simply print value of shell variable into awk program as input and then setting field separator as - and printing the last field value in it.
echo "$string" | awk -F'-' '{print $NF}'
2nd solution: In case you could have anything else also apart from version number in last field of your value(where - is field delimiter) then use match function of awk.
echo "$var" |
awk -F'-' 'match($NF,/[0-9]+(\.[0-9]+)*/){print substr($NF,RSTART,RLENGTH)}'
3rd solution: Using GNU grep try following once. Using \K option for GNU grep here. This will match everything till - and then mentioning \K will forget OR wouldn't consider that matched value for printing and will print all further matched value(with further mentioned regex).
echo "$var" | grep -oP '.*-\K\d+(\.\d+)*'
This should work in any grep:
s='1012-EPS-Test-OF-Something-1.3'
grep -Eo '[0-9]+(\.[0-9]+)+' <<< "$s"
1.3
This might work for you (GNU sed):
sed -n 's/.*[^0-9.]//p' file
The regexp is greedy and swallows the whole line .* then steps back a character at a time till the first match of [^0-9.], removes the front portion and prints the remainder.
You can use string manipulation to get the last part after -:
s='1012-EPS-Test-OF-Something-1.3'
s="${s##*-}"
See this online demo:
#!/bin/bash
s='1012-EPS-Test-OF-Something-1.3'
s="${s##*-}"
echo "$s"
# => 1.3
See 10.1. Manipulating Strings:
${string##substring}
    Deletes longest match of $substring from front of $string.

pipe sed command to create multiple files

I need to get X to Y in the file with multiple occurrences, each time it matches an occurrence it will save to a file.
Here is an example file (demo.txt):
\x00START how are you? END\x00
\x00START good thanks END\x00
sometimes random things\x00\x00 inbetween it (ignore this text)
\x00START thats nice END\x00
And now after running a command each file (/folder/demo1.txt, /folder/demo2.txt, etc) should have the contents between \x00START and END\x00 (\x00 is null) in addition to 'START' but not 'END'.
/folder/demo1.txt should say "START how are you? ", /folder/demo2.txt should say "START good thanks".
So basicly it should pipe "how are you?" and using 'echo' I can prepend the 'START'.
It's worth keeping in mind that I am dealing with a very large binary file.
I am currently using
sed -n -e '/\x00START/,/END\x00/ p' demo.txt > demo1.txt
but that's not working as expected (it's getting lines before the '\x00START' and doesn't stop at the first 'END\x00').
If you have GNU awk, try:
awk -v RS='\0START|END\0' '
length($0) {printf "START%s\n", $0 > ("folder/demo"++i".txt")}
' demo.txt
RS='\0START|END\0' defines a regular expression acting as the [input] Record Separator which breaks the input file into records by strings (byte sequences) between \0START and END\0 (\0 represents NUL (null char.) here).
Using a multi-character, regex-based record separate is NOT POSIX-compliant; GNU awk supports it (as does mawk in general, but seemingly not with NUL chars.).
Pattern length($0) ensures that the associated action ({...}) is only executed if the records is nonempty.
{printf "START%s\n", $0 > ("folder/demo"++i)} outputs each nonempty record preceded by "START", into file folder/demo{n}.txt", where {n} represent a sequence number starting with 1.
You can use grep for that:
grep -Po "START\s+\K.*?(?=END)" file
how are you?
good thanks
thats nice
Explanation:
-P To allow Perl regex
-o To extract only matched pattern
-K Positive lookbehind
(?=something) Positive lookahead
EDIT: To match \00 as START and END may appear in between:
echo -e '\00START hi how are you END\00' | grep -aPo '\00START\K.*?(?=END\00)'
hi how are you
EDIT2: The solution using grep would only match single line, for multi-line it's better use perl instead. The syntax will be very similar:
echo -e '\00START hi \n how\n are\n you END\00' | perl -ne 'BEGIN{undef $/ } /\A.*?\00START\K((.|\n)*?)(?=END)/gm; print $1'
hi
how
are
you
What's new here:
undef $/ Undefine INPUT separator $/ which defaults to '\n'
(.|\n)* Dot matches almost any character, but it does not match
\n so we need to add it here.
/gm Modifiers, g for global m for multi-line
I would translate the nulls into newlines so that grep can find your wanted text on a clean line by itself:
tr '\000' '\n' < yourfile.bin | grep "^START"
from there you can take it into sed as before.