Extract Number from Constant Output in Bash - regex

I have a script that producing this kind of output stream in infinite loop:
m 17:24:34|ethminer Speed 377.61 Mh/s gpu/0 29.01 gpu/1 29.91 gpu/2 30.21 gpu/3 28.71 gpu/4 28.11 gpu/5 27.96 gpu/6 28.71 gpu/7 29.01 gpu/8 28.48 gpu/9 28.86 gpu/10 29.91 gpu/11 29.08 gpu/12 29.68 [A1484+5:R0+0:F0] Time: 04:19
I want to extract the integer after "Speed", which is 377 in this case. So far I have, suppose the string is named string:
$string | grep -oP '(?<=Speed).*'
I got
377.61 Mh/s gpu/0 29.01 gpu/1 29.91 gpu/2 30.21 gpu/3 28.71 gpu/4 28.11 gpu/5 27.96 gpu/6 28.71 gpu/7 29.01 gpu/8 28.48 gpu/9 28.86 gpu/10 29.91
I want to get rid of the trailing string by executing:
$string | grep -oP '(?<=Speed).*' | grep -o -E '[1-9][0-9][0-9]*'
but that regular expression is wrong, it doesn't come out with anything. How can I fix this?
regards

You may use
grep -Po 'Speed\s*\K\d+'
Or, to also get the fractional part if it is necessary
grep -Po 'Speed\s*\K\d+(\.\d+)?'
See the online demo
Details
Speed - a literal substring
\s* - 0+ whitespaces
\K - a match reset operator (discarding all text matched so far from the match value)
\d+ - 1+ digits
(\.\d+)? - an optional sequence of a . and 1+ digits

If the output it always like that (i.e. not extra lines in between), a simple cut -d' ' -f6 will do the job.

awk 'match($0,"Speed [0-9]+.?[0-9]*"){print substr($0,RSTART+6,RLENGTH-6)}'
sed '/Speed/s/.*Speed \([^ ]*\).*/\1/'
and if each line is always the same way formatted, you can do:
awk '{print $6}' file
This means, that every line always has the word speed in column 5 and you want to print column 6.

Could you please try following. Considering that your Input_file is same as shown samples.
awk '{sub(/.*Speed /,"");sub(/ .*/,"")} 1' Input_file
In case you want to save output into Input_file itself then try following.
awk '{sub(/.*Speed /,"");sub(/ .*/,"")} 1' Input_file > temp_file && mv temp_file Input_file
Explanation: Adding explanation too here.
awk ' ##awk script starts from here.
{
sub(/.*Speed /,"") ##Using sub for substitution operation which will substitute from starting of line to till Speed string with NULL fir current line.
sub(/ .*/,"") ##Using sub for substitution of everything starting from space to till end in current line with NULL.
}
1 ##Mentioning 1 will print edited/non-edited lines in Input_file.
' Input_file ##Mentioning Input_file name here.

sed works too.
$: echo $string | sed -En '/ Speed /{ s/.* Speed ([0-9]+).*/\1/; p; }'
377

Related

Linux extract text between specific strings

I have multiple files with different job names. The job name is specified as follows.
#SBATCH --job-name=01_job1 #Set the job name
I want to use sed/awk/grep to automatically get the name, that is to say, what follows '--job-name=' and precedes the comment '#Set the job name'. For the example above, I want to get 01_job1. The job name could be longer for several files, and there are multiple = signs in following lines in the file.
I have tried using grep -oP "job-name=\s+\K\w+" file and get an empty output. I suspect that this doesn't work because there is no space between 'name=' and '01_job1', so they must be understood as a single word.
I also unsuccessfully tried using awk '{for (I=1;I<NF;I++) if ($I == "name=") print $(I+1)}' file, attempting to find the characters after 'name='.
Lastly, I also unsuccessfully tried sed -e 's/name=\(.*\)#Set/\1/' file to find the characters between 'name=' and the beginning of the comment '#Set'. I receive the whole file as my output when I attempt this.
I appreciate any guidance. Thank you!!
You need to match the whole string with sed and capture just what you need to get, and use -n option with the p flag:
sed -n 's/.*name=\([^[:space:]]*\).*/\1/p'
See the online demo:
#!/bin/bash
s='#SBATCH --job-name=01_job1 #Set the job name'
sed -n 's/.*name=\([^[:space:]]*\).*/\1/p' <<< "$s"
# => 01_job1
Details:
-n - suppresses default line output
.* - any text
name= - a literal name= string
\([^[:space:]]*\) - Group 1 (\1): any zero or more chars other than whitespace
.* - any text
p - print the result of the successful substitution.
Simlar to the answer of Gilles Quenot
grep -oP -- '--job-name=\K.*(?= *# *Set the job name)'
This adds a look-ahead to ensure that the string is followed by #Set the job name
1st solution: In GNU awk with your shown samples please try following awk code.
awk -v RS=' --job-name=\\S+' 'RT && split(RT,arr,"="){print arr[2]}' Input_file
OR a non-one liner form of above GNU awk code would be:
awk -v RS=' --job-name=\\S+' '
RT && split(RT,arr,"="){
print arr[2]
}
' Input_file
2nd solution: Using any awk please try following code.
awk -F'[[:space:]]+|--job-name=' '{print $3}' Input_file
3rd solution: Using GNU grep please try following code with your shown samples and using non-greedy .*? approach here in regex.
grep -oP '^.*?--job-name=\K\S+' Input_file
Use this, you was close, just correctness of your grep -oP attempt (the main issue if you are trying to match a space after = character):
$ grep -oP -- '--job-name=\K\S+' file
01_job1
The regular expression matches as follows:
Node
Explanation
job-name=
'job-name='
\K
resets the start of the match (what is Kept) as a shorter alternative to using a look-behind assertion: perlmonks look arounds and Support of K in regex
\S+
non-whitespace (all but \n, \r, \t, \f, and " ") (1 or more times (matching the most amount possible))
You can use a lookbehind and lookahead with GNU grep to get exactly what you describe:
grep -oP '(?<=--job-name=)\S+(?=\s+#Set the job name)' file
Or with awk:
awk '/^#SBATCH[[:space:]]+--job-name=/ &&
/#Set the job name$/ {
sub(/^[^=]*=/,"")
sub(/#[^#]*$/,"")
print
}' file
Or perl:
perl -lnE 'say $1 if /(?<=--job-name=)(\S+)(?=\s+#Set the job name)/' file
Any prints:
01_job1

sed - get only text without extension

How do I remove the extension in this SED statement?
Through
sed 's/.* - //'
File content
2021-04-21_#fluffyban_6953588770591509765.mp4 - Filename.mp4
Actual
Filename.mp4
Desired
Filename
With your shown samples only. This could be done with simple codes in awk,sed and perl as follows.
1st solution: Using sed, perform simple substitutions and you will get desired output.
sed 's/.*- //;s/\.mp4$//' Input_file
2nd solution: Using awk its more simpler, creating different field separator and just print appropriate 2nd last column.
awk -F'- |.mp4' '{print $(NF-1)}' Input_file
3rd solution: Using substitution method in awk to get the required value as per OP's requirement.
awk '{gsub(/.*- |\.mp4$/,"")} 1' Input_file
4th solution: With perl one liner we could grab the appropriate needed value by setting field separators as dash spaces and .mp4 as follows:
perl -a -F'-\s+|\.mp4' -ne 'print "$F[$#F-1]\n";' Input_file
The Bash way (which works in most similar shells such us zsh,sh,ksh) is:
fn="2021-04-21_#fluffyban_6953588770591509765.mp4 - Filename.mp4"
base=${fn%.*}
ext=${fn#$base.}
echo "$base"
echo "$ext"
Prints:
2021-04-21_#fluffyban_6953588770591509765.mp4 - Filename
mp4
You can use
#!/bin/bash
s='2021-04-21_#fluffyban_6953588770591509765.mp4 - Filename.mp4'
sed -n 's/.* - \([^.]*\).*/\1/p' <<< "$s"
# => Filename
See the online demo.
Details:
-n - suppress default line output
s/ - substitute found pattern
.* - \([^.]*\).* - any text, space, -, space, then any zero or more chars other than a dot captured into Group 1, and then any text
/\1/ - replace found matches with Group 1 value
p - print the result of the substitution.
Using gnu awk you can also use a capture group to get the filename
match($0, /.* - ([^.]+)\.mp4$/, a) {print a[1]}' file
Regex explanation
.* - Match the last occurrence of -
( Capture group 1 (Referred to by a[1] in the awk example)
[^.]+ Match 1+ times any char except a dot
) Close group 1
\.mp4$ Match .mp4 at the end of the string
Awk explanation
awk '
match($0, /.* - ([^.]+)\.mp4$/, a) { # Test if the line using $0 matches the pattern
print a[1] # Print the value of group 1
}
' file
Yet another awk:
awk '{sub(/\.[^.]+$/, ""); print $NF}' file
Filename
gawk/mawk/mawk2 'BEGIN { FS = "( \- |[.][^. ]+$)"
} NF > 2 { print $(NF-1) }'
no substr(), index(), match(), or sub() needed. If you're VERY certain " - " can only occur once, then
awk 'BEGIN { FS = "(^.* \- |[.][^. ]+$)"; OFS = "" } —-NF'

Grep value between strings with regex

$ acpi
Battery 0: Charging, 18%, 01:37:09 until charged
How to grep the battery level value without percentage character (18)?
This should do it but I'm getting an empty result:
acpi | grep -e '(?<=, )(.*)(?=%)'
Your regex is correct but will work with experimental -P or perl mode regex option in gnu grep. You will also need -o to show only matching text.
Correct command would be:
grep -oP '(?<=, )\d+(?=%)'
However, if you don't have gnu grep then you can also use sed like this:
sed -nE 's/.*, ([0-9]+)%.*/\1/p' file
18
Could you please try following, written and tested in link https://ideone.com/nzSGKs
your_command | awk 'match($0,/Charging, [0-9]+%/){print substr($0,RSTART+10,RLENGTH-11)}'
Explanation: Adding detailed explanation for above only for explanation purposes.
your_command | ##Running OP command and passing its output to awk as standrd input here.
awk ' ##Starting awk program from here.
match($0,/Charging, [0-9]+%/){ ##Using match function to match regex Charging, [0-9]+% in line here.
print substr($0,RSTART+10,RLENGTH-11) ##Printing sub string and printing from 11th character from starting and leaving last 11 chars here in matched regex of current line.
}'
Using awk:
awk -F"," '{print $2+0}'
Using GNU sed:
sed -rn 's/.*\, *([0-9]+)\%\,.*/\1/p'
You can use sed:
$ acpi | sed -nE 's/.*Charging, ([[:digit:]]*)%.*/\1/p'
18
Or, if Charging is not always in the string, you can look for the ,:
$ acpi | sed -nE 's/[^,]*, ([[:digit:]]*)%.*/\1/p'
Using bash:
s='Battery 0: Charging, 18%, 01:37:09 until charged'
res="${s#*, }"
res="${res%%%*}"
echo "$res"
Result: 18.
res="${s#*, }" removes text from the beginning to the first comma+space and "${res%%%*}" removes all text from end till (and including) the last occurrence of %.

Sed Regular expression that get data between rx: and [space]

I have this expression that take everything
local RX=`sed -e 's#.*rx:\(\)#\1#' <<< "${LINE}"`
The variable LINE has as content:
4: uart:PL011 rev3 mmio:0xC006D000 irq:26 tx:435 rx:0 RTS|DTR
What I want is returning rx value, in this case, 0
Right now, it is returing everything after "rx:"
How should I do?
You may extract all digits after rx: using
RX=`sed -e 's#.*rx:\([0-9]*\).*#\1#' <<< "${LINE}"`
See the online demo
I added [0-9]* between \(\) to match 0 or more digits and also a .* pattern at the end of the regex to consume the rest of the line, so that in the output, you could have just the value captured in Group 1.
To match any chars other than whitespace replace [0-9] with [^[:space:]] or [^[:blank:]], or even [^ ].
A better approach:
$ awk -v tag='rx' '{for (i=1;i<=NF;i++){ split($i,t,/:/); f[t[1]]=t[2] } print f[tag]}' <<<"$line"
0
$ awk -v tag='mmio' '{for (i=1;i<=NF;i++){ split($i,t,/:/); f[t[1]]=t[2] } print f[tag]}' <<<"$line"
0xC006D000
$ awk -v tag='uart' '{for (i=1;i<=NF;i++){ split($i,t,/:/); f[t[1]]=t[2] } print f[tag]}' <<<"$line"
PL011
Given that you can simply print individual or as many values as you like.
You can change to this:
local RX=`sed -e 's#.*rx:\([^ ]*\).*#\1#' <<< "${LINE}"`
But in this case, if you can use GNU grep then it's better:
local RX=`grep -oP 'rx:\K[^ ]*' <<< "${LINE}"`
They're to capture things after rx: and before the space.
Could you please try following too.
awk '
match($0,/rx[^ ]*/){
val=substr($0,RSTART,RLENGTH)
sub(/.*:/,"",val)
print val
val=""
}
' Input_file

sed regex with multiple matches and condition

I would like to convert strings like:
abc=123.24|127.9|2891;xyz;hgy
to:
abc=123.24,127.9,2891;xyz;hgy
This is close:
echo "abc=123.24|127.9|2891;xyz;hgy" | sed -r 's/(=)([0-9.]+)\|/\1\2,/g'
but returns:
abc=123.24,127.9|2891;xyz;hgy
How can I do the rest of the numbers in a similar fashion if the number of bar-separated numbers is variable?
Clarification:
I hate it when people do not give me the whole picture in questions, but my original description above did just that. The small example is embedded in a much larger line that includes "|" separated text. I want to replace only the "|" with "," between numbers that follow an equal sign. Here is an entire line as an example:
chr1 69511 rs75062661 A G . QSS_ref ASP;BaseCounts=375,3,118,4;CAF=[0.348,0.652];COMMON=1;EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Aca/Gca|T141A|305|OR4F5|protein_coding|CODING|ENST00000335137|1|1);GNO;HRun=0;HaplotypeScore=0.0000;KGPROD;KGPhase1;LowMQ=0.0280,0.0580,500;MQ=49.32;MQ0=14;MSigDb=ACEVEDO_METHYLATED_IN_LIVER_CANCER_DN,KEGG_OLFACTORY_TRANSDUCTION,REACTOME_GPCR_DOWNSTREAM_SIGNALING,REACTOME_OLFACTORY_SIGNALING_PATHWAY,REACTOME_SIGNALING_BY_GPCR,chr1p36;NORMALT=86;NORMREF=228;NSM;NT=het;OTHERKG;QSS=8;QSS_NT=6;REF;RS=75062661;RSPOS=69511;S3D;SAO=0;SGT=AG->AG;SOMATIC;SSR=0;TQSS=1;TQSS_NT=2;TUMALT=15;TUMREF=227;TUMVAF=0.06198347107438017;TUMVARFRACTION=0.1485148514851485;VC=SNV;VLD;VP=0x050200000a05140116000100;WGT=1;dbNSFP_1000Gp1_AC=1424;dbNSFP_1000Gp1_AF=0.652014652014652;dbNSFP_1000Gp1_AFR_AC=162;dbNSFP_1000Gp1_AFR_AF=0.32926829268292684;dbNSFP_1000Gp1_AMR_AC=235;dbNSFP_1000Gp1_AMR_AF=0.649171270718232;dbNSFP_1000Gp1_ASN_AC=500;dbNSFP_1000Gp1_ASN_AF=0.8741258741258742;dbNSFP_1000Gp1_EUR_AC=527;dbNSFP_1000Gp1_EUR_AF=0.6952506596306068;dbNSFP_29way_logOdds=4.1978;dbNSFP_29way_pi=0.1516:0.0:0.6258:0.2226;dbNSFP_ESP6500_AA_AF=0.544101;dbNSFP_ESP6500_EA_AF=0.887429;dbNSFP_Ensembl_geneid=ENSG00000186092;dbNSFP_Ensembl_transcriptid=ENST00000534990|ENST00000335137;dbNSFP_FATHMM_score=0.51;dbNSFP_GERP++_NR=2.31;dbNSFP_GERP++_RS=1.15;dbNSFP_Interpro_domain=GPCR|_rhodopsin-like_superfamily_(1)|;dbNSFP_LRT_Omega=0.000000;dbNSFP_LRT_pred=N;dbNSFP_LRT_score=0.000427;dbNSFP_MutationAssessor_pred=neutral;dbNSFP_MutationAssessor_score=-1.295;dbNSFP_MutationTaster_pred=N;dbNSFP_MutationTaster_score=0.000162;dbNSFP_Polyphen2_HDIV_pred=B;dbNSFP_Polyphen2_HVAR_pred=B;dbNSFP_SIFT_score=0.950000;dbNSFP_Uniprot_aapos=141;dbNSFP_Uniprot_acc=Q8NH21;dbNSFP_Uniprot_id=OR4F5_HUMAN;dbNSFP_aaalt=A;dbNSFP_aapos=189|141;dbNSFP_aaref=T;dbNSFP_cds_strand=+;dbNSFP_codonpos=1;dbNSFP_fold-degenerate=0;dbNSFP_phyloP=0.267000;dbNSFP_refcodon=ACA;dbSNPBuildID=131 AU:CU:DP:FDP:GU:SDP:SUBDP:TU 228,232:3,3:322:4:86,109:0:0:1,2 227,228:0,0:244:1:15,16:0:0:1,2
The replacement in this line is of the string:
dbNSFP_aapos=189|141
with:
dbNSFP_aapos=189,141
why not:
sed 's/|/,/g'
kent$ echo "abc=123.24|127.9|2891;xyz;hgy"|sed 's/|/,/g'
abc=123.24,127.9,2891;xyz;hgy
Without using perl you can use
input="abc=123;def=hello123test;dbNSFP_aapos=189|141|142;dbNSFP_aaref=T;another=test|hello;"
sed -r 's/(.*=)([0-9.]+\|)+(.*)/\1'$(sed -r 's/(.*=)([0-9.]+\|)+(.*)/\2/' <<< $input | tr '|' ,)'\3/' <<< $input
Output:
abc=123;def=hello123test;dbNSFP_aapos=189,141,142;dbNSFP_aaref=T;another=test|hello;
Replace <<< $input with your file or whatever you actually have as input :)
Explanation:
We have three capturing groups in the regex (I restructured the groups from the OP), the second will contain only the string where the replacement of the | is to happen, while the first and third contain everything before and after the second group.
See the demo # regex101.
Within the second command ($(...)) we grab the second capturing group with sed and replace every | inside with a comma. This substitution is then used to be put in the place of the second group within the other sed-call.
As alternative, you can try with perl and its evalutation flag:
echo "..." | perl -pe 's{=([\d.|]+)}{"=" . (join ",", split /\|/, $1)}eg'
It searches for a string after an equal sign, splits it with | and join it with commas.
Using tr
echo "abc=123.24|127.9|2891;xyz;hgy" | tr \| ,
abc=123.24,127.9,2891;xyz;hgy
Assuming semicolon is your field separator, how about something like
tr `;\n' '\n;' | sed '/=[0-9.|]*$/s/|/,/g' | tr '\n;' ';\n'
This has a serious flaw; it fails in weird ways for the first and last fields on a line. If you can't live with that, maybe try this:
awk -F ';' '{ for(i=1; i<=NF; ++i) if ($i ~ /=([0-9.]+\|)+[0-9.]+$/) gsub(/\|/,",",$i); print }'