awk matches regex characters it shouldn't

awk matches regex characters it shouldn't - regex

My awk program does some odd character matching. Could you please explain what's going on or point me to relevant documentation.
Input file
| 29900 | St. James | ...
| 33010 | Boole / Kirk | ...
awk
awk '/\| ([0-9]{5}) \| ([^\|]*)/{print $2 $4}' input-file.txt
Result
29900St.
33010Boole
Why is the first capturing group $1 the leading |? Usually $0 is the entire match and $1 is the first group.
Why does ([^\|]*) stop at . and / instead of reading on? I basically tell it "all characters that are not |" after all.

By default, awk separates columns by whitespace, so for the record
| 29900 | St. James | ...
we have $1="|", $2="29900", $3="|", $4="St.", $5="James", $6="|" and $7="..."
Additionally, unlike Perl, awk does not store the contents of capturing parentheses anywhere (gawk does though)
Seeing as you want to use pipes as separators, I'd suggest:
awk -F '[[:blank:]]*\\|[[:blank:]]*' -v OFS=, '$2 ~ /[0-9]{5}/ {print $2,$3}'
29900,St. James
33010,Boole / Kirk
If you're confused about seeing $2 and $3 in there instead of $1 and $2, consider that a field separator, by definition, separates two fields and must have a field before it and after it. The first field separator shows up at the beginning of each line, therefore there must be a field consisting of an empty string before it: $1 will be the empty string.

awk doesn't provide a way to access capture groups, it uses $<number> to access fields of the input file. It looks like you could do:
awk -F' *\| *' '{print $2 $3;}' input-file.txt

Related

Extract Number from Constant Output in Bash

I have a script that producing this kind of output stream in infinite loop:
m 17:24:34|ethminer Speed 377.61 Mh/s gpu/0 29.01 gpu/1 29.91 gpu/2 30.21 gpu/3 28.71 gpu/4 28.11 gpu/5 27.96 gpu/6 28.71 gpu/7 29.01 gpu/8 28.48 gpu/9 28.86 gpu/10 29.91 gpu/11 29.08 gpu/12 29.68 [A1484+5:R0+0:F0] Time: 04:19
I want to extract the integer after "Speed", which is 377 in this case. So far I have, suppose the string is named string:
$string | grep -oP '(?<=Speed).*'
I got
377.61 Mh/s gpu/0 29.01 gpu/1 29.91 gpu/2 30.21 gpu/3 28.71 gpu/4 28.11 gpu/5 27.96 gpu/6 28.71 gpu/7 29.01 gpu/8 28.48 gpu/9 28.86 gpu/10 29.91
I want to get rid of the trailing string by executing:
$string | grep -oP '(?<=Speed).*' | grep -o -E '[1-9][0-9][0-9]*'
but that regular expression is wrong, it doesn't come out with anything. How can I fix this?
regards

You may use
grep -Po 'Speed\s*\K\d+'
Or, to also get the fractional part if it is necessary
grep -Po 'Speed\s*\K\d+(\.\d+)?'
See the online demo
Details
Speed - a literal substring
\s* - 0+ whitespaces
\K - a match reset operator (discarding all text matched so far from the match value)
\d+ - 1+ digits
(\.\d+)? - an optional sequence of a . and 1+ digits

If the output it always like that (i.e. not extra lines in between), a simple cut -d' ' -f6 will do the job.

awk 'match($0,"Speed [0-9]+.?[0-9]*"){print substr($0,RSTART+6,RLENGTH-6)}'
sed '/Speed/s/.*Speed \([^ ]*\).*/\1/'
and if each line is always the same way formatted, you can do:
awk '{print $6}' file
This means, that every line always has the word speed in column 5 and you want to print column 6.

Could you please try following. Considering that your Input_file is same as shown samples.
awk '{sub(/.*Speed /,"");sub(/ .*/,"")} 1' Input_file
In case you want to save output into Input_file itself then try following.
awk '{sub(/.*Speed /,"");sub(/ .*/,"")} 1' Input_file > temp_file && mv temp_file Input_file
Explanation: Adding explanation too here.
awk ' ##awk script starts from here.
{
sub(/.*Speed /,"") ##Using sub for substitution operation which will substitute from starting of line to till Speed string with NULL fir current line.
sub(/ .*/,"") ##Using sub for substitution of everything starting from space to till end in current line with NULL.
}
1 ##Mentioning 1 will print edited/non-edited lines in Input_file.
' Input_file ##Mentioning Input_file name here.

sed works too.
$: echo $string | sed -En '/ Speed /{ s/.* Speed ([0-9]+).*/\1/; p; }'
377

sed regex cut string after match

I tested a regex on http://regexr.com/ and it works like expected.
How can I run this by using sed?
/^.*?OU=([^,]*)/g
The test string looks like:
mario.test;Mario Test;Mario;Test;123;+001122334455;CN=Mario Test,OU=AT-Test,OU=Tese Sites,DC=Test,DC=local;test.local
And the output is:
mario.test;Mario Test;Mario;Test;123;+001122334455;CN=Mario Test,OU=AT-Test
So it should cut the string before the second OU= starts.
Thanks

sed is not the best tool for this case when you have to deal with text that contains "columns" and can be split. Here are two possibilities, one with sed and the other with awk:
s="mario.test;Mario Test;Mario;Test;123;+001122334455,CN=Mario Test,OU=AT-Linz,OU=Tese Sites,DC=Test,DC=local;test.local"
echo $s | sed 's/OU=/й/' | sed 's/\([^й]*\)й\([^,]*\).*/\1OU=\2/'
echo $s | awk -F",OU=" '{print $1 ",OU=" $2}'
See the online demo
The awk solution splits with ,OU= substring and then joins the first and second column with the separator (since it is hardcoded, it is easy to put it back).
sed uses 2 passes: 1) add a non-used char (must be a control char, here, a Cyrillic letter is used for better "visibility") to mark the border of our match, 2) match all we do not need and match and capture what we need to keep with the help of capturing groups and backreferences.

Your question isn't clear but from reading your comments, are either of these what you're looking for?
$ awk -F, '{print $1 FS $2}' file
mario.test;Mario Test;Mario;Test;123;+001122334455;CN=Mario Test,OU=AT-Test
$ awk -F'CN=[^,]+,OU=|,' '{print $1 $2}' file
mario.test;Mario Test;Mario;Test;123;+001122334455;AT-Test

Remove everything after 2nd occurrence in a string in unix

I would like to remove everything after the 2nd occurrence of a particular
pattern in a string. What is the best way to do it in Unix? What is most elegant and simple method to achieve this; sed, awk or just unix commands like cut?
My input would be
After-u-math-how-however
Output should be
After-u
Everything after the 2nd - should be stripped out. The regex should also match
zero occurrences of the pattern, so zero or one occurrence should be ignored and
from the 2nd occurrence everything should be removed.
So if the input is as follows
After
Output should be
After

Something like this would do it.
echo "After-u-math-how-however" | cut -f1,2 -d'-'
This will split up (cut) the string into fields, using a dash (-) as the delimiter. Once the string has been split into fields, cut will print the 1st and 2nd fields.

This might work for you (GNU sed):
sed 's/-[^-]*//2g' file

You could use the following regex to select what you want:
^[^-]*-\?[^-]*
For example:
echo "After-u-math-how-however" | grep -o "^[^-]*-\?[^-]*"
Results:
After-u

#EvanPurkisher's cut -f1,2 -d'-' solution is IMHO the best one but since you asked about sed and awk:
With GNU sed for -r
$ echo "After-u-math-how-however" | sed -r 's/([^-]+-[^-]*).*/\1/'
After-u
With GNU awk for gensub():
$ echo "After-u-math-how-however" | awk '{$0=gensub(/([^-]+-[^-]*).*/,"\\1","")}1'
After-u
Can be done with non-GNU sed using \( and *, and with non-GNU awk using match() and substr() if necessary.

awk -F - '{print $1 (NF>1? FS $2 : "")}' <<<'After-u-math-how-however'
Split the line into fields based on field separator - (option spec. -F -) - accessible as special variable FS inside the awk program.
Always print the 1st field (print $1), followed by:
If there's more than 1 field (NF>1), append FS (i.e., -) and the 2nd field ($2)
Otherwise: append "", i.e.: effectively only print the 1st field (which in itself may be empty, if the input is empty).

This can be done in pure bash (which means no fork, no external process). Read into an array split on '-', then slice the array:
$ IFS=-
$ read -ra val <<< After-u-math-how-however
$ echo "${val[*]}"
After-u-math-how-however
$ echo "${val[*]:0:2}"
After-u

awk '$0 = $2 ? $1 FS $2 : $1' FS=-
Result
After-u
After

This will do it in awk:
echo "After" | awk -F "-" '{printf "%s",$1; for (i=2; i<=2; i++) printf"-%s",$i}'

sed regex with multiple matches and condition

I would like to convert strings like:
abc=123.24|127.9|2891;xyz;hgy
to:
abc=123.24,127.9,2891;xyz;hgy
This is close:
echo "abc=123.24|127.9|2891;xyz;hgy" | sed -r 's/(=)([0-9.]+)\|/\1\2,/g'
but returns:
abc=123.24,127.9|2891;xyz;hgy
How can I do the rest of the numbers in a similar fashion if the number of bar-separated numbers is variable?
Clarification:
I hate it when people do not give me the whole picture in questions, but my original description above did just that. The small example is embedded in a much larger line that includes "|" separated text. I want to replace only the "|" with "," between numbers that follow an equal sign. Here is an entire line as an example:
chr1 69511 rs75062661 A G . QSS_ref ASP;BaseCounts=375,3,118,4;CAF=[0.348,0.652];COMMON=1;EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Aca/Gca|T141A|305|OR4F5|protein_coding|CODING|ENST00000335137|1|1);GNO;HRun=0;HaplotypeScore=0.0000;KGPROD;KGPhase1;LowMQ=0.0280,0.0580,500;MQ=49.32;MQ0=14;MSigDb=ACEVEDO_METHYLATED_IN_LIVER_CANCER_DN,KEGG_OLFACTORY_TRANSDUCTION,REACTOME_GPCR_DOWNSTREAM_SIGNALING,REACTOME_OLFACTORY_SIGNALING_PATHWAY,REACTOME_SIGNALING_BY_GPCR,chr1p36;NORMALT=86;NORMREF=228;NSM;NT=het;OTHERKG;QSS=8;QSS_NT=6;REF;RS=75062661;RSPOS=69511;S3D;SAO=0;SGT=AG->AG;SOMATIC;SSR=0;TQSS=1;TQSS_NT=2;TUMALT=15;TUMREF=227;TUMVAF=0.06198347107438017;TUMVARFRACTION=0.1485148514851485;VC=SNV;VLD;VP=0x050200000a05140116000100;WGT=1;dbNSFP_1000Gp1_AC=1424;dbNSFP_1000Gp1_AF=0.652014652014652;dbNSFP_1000Gp1_AFR_AC=162;dbNSFP_1000Gp1_AFR_AF=0.32926829268292684;dbNSFP_1000Gp1_AMR_AC=235;dbNSFP_1000Gp1_AMR_AF=0.649171270718232;dbNSFP_1000Gp1_ASN_AC=500;dbNSFP_1000Gp1_ASN_AF=0.8741258741258742;dbNSFP_1000Gp1_EUR_AC=527;dbNSFP_1000Gp1_EUR_AF=0.6952506596306068;dbNSFP_29way_logOdds=4.1978;dbNSFP_29way_pi=0.1516:0.0:0.6258:0.2226;dbNSFP_ESP6500_AA_AF=0.544101;dbNSFP_ESP6500_EA_AF=0.887429;dbNSFP_Ensembl_geneid=ENSG00000186092;dbNSFP_Ensembl_transcriptid=ENST00000534990|ENST00000335137;dbNSFP_FATHMM_score=0.51;dbNSFP_GERP++_NR=2.31;dbNSFP_GERP++_RS=1.15;dbNSFP_Interpro_domain=GPCR|_rhodopsin-like_superfamily_(1)|;dbNSFP_LRT_Omega=0.000000;dbNSFP_LRT_pred=N;dbNSFP_LRT_score=0.000427;dbNSFP_MutationAssessor_pred=neutral;dbNSFP_MutationAssessor_score=-1.295;dbNSFP_MutationTaster_pred=N;dbNSFP_MutationTaster_score=0.000162;dbNSFP_Polyphen2_HDIV_pred=B;dbNSFP_Polyphen2_HVAR_pred=B;dbNSFP_SIFT_score=0.950000;dbNSFP_Uniprot_aapos=141;dbNSFP_Uniprot_acc=Q8NH21;dbNSFP_Uniprot_id=OR4F5_HUMAN;dbNSFP_aaalt=A;dbNSFP_aapos=189|141;dbNSFP_aaref=T;dbNSFP_cds_strand=+;dbNSFP_codonpos=1;dbNSFP_fold-degenerate=0;dbNSFP_phyloP=0.267000;dbNSFP_refcodon=ACA;dbSNPBuildID=131 AU:CU:DP:FDP:GU:SDP:SUBDP:TU 228,232:3,3:322:4:86,109:0:0:1,2 227,228:0,0:244:1:15,16:0:0:1,2
The replacement in this line is of the string:
dbNSFP_aapos=189|141
with:
dbNSFP_aapos=189,141

why not:
sed 's/|/,/g'
kent$ echo "abc=123.24|127.9|2891;xyz;hgy"|sed 's/|/,/g'
abc=123.24,127.9,2891;xyz;hgy

Without using perl you can use
input="abc=123;def=hello123test;dbNSFP_aapos=189|141|142;dbNSFP_aaref=T;another=test|hello;"
sed -r 's/(.*=)([0-9.]+\|)+(.*)/\1'$(sed -r 's/(.*=)([0-9.]+\|)+(.*)/\2/' <<< $input | tr '|' ,)'\3/' <<< $input
Output:
abc=123;def=hello123test;dbNSFP_aapos=189,141,142;dbNSFP_aaref=T;another=test|hello;
Replace <<< $input with your file or whatever you actually have as input :)
Explanation:
We have three capturing groups in the regex (I restructured the groups from the OP), the second will contain only the string where the replacement of the | is to happen, while the first and third contain everything before and after the second group.
See the demo # regex101.
Within the second command ($(...)) we grab the second capturing group with sed and replace every | inside with a comma. This substitution is then used to be put in the place of the second group within the other sed-call.

As alternative, you can try with perl and its evalutation flag:
echo "..." | perl -pe 's{=([\d.|]+)}{"=" . (join ",", split /\|/, $1)}eg'
It searches for a string after an equal sign, splits it with | and join it with commas.

Using tr
echo "abc=123.24|127.9|2891;xyz;hgy" | tr \| ,
abc=123.24,127.9,2891;xyz;hgy

Assuming semicolon is your field separator, how about something like
tr `;\n' '\n;' | sed '/=[0-9.|]*$/s/|/,/g' | tr '\n;' ';\n'
This has a serious flaw; it fails in weird ways for the first and last fields on a line. If you can't live with that, maybe try this:
awk -F ';' '{ for(i=1; i<=NF; ++i) if ($i ~ /=([0-9.]+\|)+[0-9.]+$/) gsub(/\|/,",",$i); print }'

Extract multiple occurrences on the same line using sed/regex

I am trying to loop through each line in a file and find and extract letters that start with ${ and end with }. So as the final output I am expecting only SOLDIR and TEMP(from inputfile.sh).
I have tried using the following script but it seems it matches and extracts only the second occurrence of the pattern TEMP. I also tried adding g at the end but it doesn't help. Could anybody please let me know how to match and extract both/multiple occurrences on the same line ?
inputfile.sh:
.
.
SOLPORT=\`grep -A 4 '\[LocalDB\]' \${SOLDIR}/solidhac.ini | grep \${TEMP} | awk '{print $2}'\`
.
.
script.sh:
infile='inputfile.sh'
while read line ; do
echo $line | sed 's%.*${\([^}]*\)}.*%\1%g'
done < "$infile"

May I propose a grep solution?
grep -oP '(?<=\${).*?(?=})'
It uses Perl-style lookaround assertions and lazily matches anything between '${' and '}'.
Feeding your line to it, I get
$ echo "SOLPORT=\`grep -A 4 '[LocalDB]' \${SOLDIR}/solidhac.ini | grep \${TEMP} | awk '{print $2}'\`" | grep -oP '(?<=\${).*?(?=})'
SOLDIR
TEMP

This might work for you (but maybe only for your specific input line):
sed 's/[^$]*\(${[^}]\+}\)[^$]*/\1\t/g;s/$[^{$]\+//g'

Extracting multiple matches from a single line using sed isn't as bad as I thought it'd be, but it's still fairly esoteric and difficult to read:
$ echo 'Hello ${var1}, how is your ${var2}' | sed -En '
# Replace ${PREFIX}${TARGET}${SUFFIX} with ${PREFIX}\a${TARGET}\n${SUFFIX}
s#\$\{([^}]+)\}#\a\1\n#
# Continue to next line if no matches.
/\n/!b
# Remove the prefix.
s#.*\a##
# Print up to the first newline.
P
# Delete up to the first newline and reprocess what's left of the line.
D
'
var1
var2
And all on one line:
sed -En 's#\$\{([^}]+)\}#\a\1\n#;/\n/!b;s#.*\a##;P;D'
Since POSIX extended regexes don't support non-greedy quantifiers or putting a newline escape in a bracket expression I've used a BEL character (\a) as a sentinel at the end of the prefix instead of a newline. A newline could be used, but then the second substitution would have to be the questionable s#.*\n(.*\n.*)##, which might involve a pathological amount of backtracking by the regex engine.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

awk matches regex characters it shouldn't - regex

awk doesn't provide a way to access capture groups, it uses $<number> to access fields of the input file. It looks like you could do: awk -F' \| ' '{print $2 $3;}' input-file.txt

Related

Extract Number from Constant Output in Bash

sed regex cut string after match

Remove everything after 2nd occurrence in a string in unix

sed regex with multiple matches and condition

Extract multiple occurrences on the same line using sed/regex

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

awk matches regex characters it shouldn't - regex

awk doesn't provide a way to access capture groups, it uses $<number> to access fields of the input file. It looks like you could do: awk -F' *\| *' '{print $2 $3;}' input-file.txt

Related

Extract Number from Constant Output in Bash

sed regex cut string after match

Remove everything after 2nd occurrence in a string in unix

sed regex with multiple matches and condition

Extract multiple occurrences on the same line using sed/regex

Categories

Resources

awk doesn't provide a way to access capture groups, it uses $<number> to access fields of the input file. It looks like you could do: awk -F' \| ' '{print $2 $3;}' input-file.txt