Extract multiple occurrences on the same line using sed/regex

Extract multiple occurrences on the same line using sed/regex - regex

I am trying to loop through each line in a file and find and extract letters that start with ${ and end with }. So as the final output I am expecting only SOLDIR and TEMP(from inputfile.sh).
I have tried using the following script but it seems it matches and extracts only the second occurrence of the pattern TEMP. I also tried adding g at the end but it doesn't help. Could anybody please let me know how to match and extract both/multiple occurrences on the same line ?
inputfile.sh:
.
.
SOLPORT=\`grep -A 4 '\[LocalDB\]' \${SOLDIR}/solidhac.ini | grep \${TEMP} | awk '{print $2}'\`
.
.
script.sh:
infile='inputfile.sh'
while read line ; do
echo $line | sed 's%.*${\([^}]*\)}.*%\1%g'
done < "$infile"

May I propose a grep solution?
grep -oP '(?<=\${).*?(?=})'
It uses Perl-style lookaround assertions and lazily matches anything between '${' and '}'.
Feeding your line to it, I get
$ echo "SOLPORT=\`grep -A 4 '[LocalDB]' \${SOLDIR}/solidhac.ini | grep \${TEMP} | awk '{print $2}'\`" | grep -oP '(?<=\${).*?(?=})'
SOLDIR
TEMP

This might work for you (but maybe only for your specific input line):
sed 's/[^$]*\(${[^}]\+}\)[^$]*/\1\t/g;s/$[^{$]\+//g'

Extracting multiple matches from a single line using sed isn't as bad as I thought it'd be, but it's still fairly esoteric and difficult to read:
$ echo 'Hello ${var1}, how is your ${var2}' | sed -En '
# Replace ${PREFIX}${TARGET}${SUFFIX} with ${PREFIX}\a${TARGET}\n${SUFFIX}
s#\$\{([^}]+)\}#\a\1\n#
# Continue to next line if no matches.
/\n/!b
# Remove the prefix.
s#.*\a##
# Print up to the first newline.
P
# Delete up to the first newline and reprocess what's left of the line.
D
'
var1
var2
And all on one line:
sed -En 's#\$\{([^}]+)\}#\a\1\n#;/\n/!b;s#.*\a##;P;D'
Since POSIX extended regexes don't support non-greedy quantifiers or putting a newline escape in a bracket expression I've used a BEL character (\a) as a sentinel at the end of the prefix instead of a newline. A newline could be used, but then the second substitution would have to be the questionable s#.*\n(.*\n.*)##, which might involve a pathological amount of backtracking by the regex engine.

Related

find recurring pattern with `sed`

I am using GNU bash 4.3.48
I expected that
echo "23S62M1I19M2D" | sed 's/.*\([0-9]*M\).*/\1/g'
would output 62M19M... But it doesn't.
sed 's/\([0-9]*M\)//g' deletes ALL [0-9]*M and retrieves 23S1I2D. but the group \1 is not working as I thought it would.
sed 's/.*\([0-9]*M\).*/ \1 /g', retrieves M...
What am I doing wrong?
Thank you!

With your shown samples and with awk you could try following program.
echo "23S62M1I19M2D" |
awk '
{
val=""
while(match($0,/[0-9]+M/)){
val=val substr($0,RSTART,RLENGTH)
$0=substr($0,RSTART+RLENGTH)
}
print val
}
'
Explanation: Simple explanation would be, using echo to print values and sending it as a standard input to awk program. In awk program using its match function to match regex mentioned in it(/[0-9]+M) running loop to find all matches in each line and printing the collected matched values at last of each line.

This might work for you (GNU sed):
sed -nE '/[0-9]*M/{s//\n&\n/g;s/(^|\n)[^\n]*\n?//gp}' file
Surround the match by newlines and then remove non-matching parts.
Alternative, using grep and tr:
grep -o '[0-9]*M' file | tr -d '\n'
N.B. tr removes all newlines (including the last one) to restore the last newline, use:
grep -o '[0-9]*M' file | tr -d '\n' | paste
The alternate solution will concatenate all results into a single line. To achieve the same result with the first solution use:
sed -nE '/[0-9]*M/{s//\n&\n/g;s/(^|\n)[^\n]*\n?//g;H};${x;s/\n//gp}' file

The problem is that the .* is greedy. Since only M is obligatory, when the engine finds last M, it satisfies the regex, so all string is matched, M is captured and thus kept after replacing with \1 backreference.
That means, you can't easily do this with sed. You can do that with Perl much easier since it supports matching and skipping pattern:
#!/bin/bash
perl -pe 's/\d+M(*SKIP)(*F)|.//g' <<< "23S62M1I19M2D"
See the online demo. The pattern matches
\d+M(*SKIP)(*F) - one or more digits, M, and then the match is omitted and the next match is searched for from the failure position
|. - or matches any char other than a line break char.
Or simply match all occurrences and concatenate them:
perl -lane 'BEGIN{$a="";} while (/\d+M/g) {$a .= $&} END{print $a;}' <<< "23S62M1I19M2D"
All \d+M matches are appended to the $a variable which is printed at the end of processing the string.

Your substitution is probably working, but not substituting what you think it is.
In the substitution s/\(foo...\)/\1/, the \1 matches whatever \(...\) matches and captures, so your substitution is replacing foo... by foo...!
% echo "1234ABC" | sed 's/\([A-Z]\)/-\1-/'g
1234-A--B--C-
So you'll need to match more, but capture only a portion of the match. For example:
echo "23S62M1I19M2D" | sed 's/[0-9]*[A-LN-Z]*\([0-9]*M\)/\1/g'
62M19M2D
In the case of sed 's/.*\([0-9]*M\).*/\1/g' (did that appear in an edit to the question, or did I just miss it?), the .* matches ‘greedily’ – it matches as much as it possibly can, thus including the digits before the M. In the example above, the [A-LN-Z] is required to be at the end of the uncaptured part, so the digits are forced to be matched by the [0-9] inside the capture.
Getting a clear idea of what ‘greedy’ means is a really important idea when writing or interpreting regexps.

If you know you will only encounter the suffixes S, M, I and D, an alternative approach would be explicitly deleting the combinations you don't want:
echo "23S62M1I19M2D" | sed 's/[0-9]\+[SID]//g'
This gives the expected:
62M19M
Update: This variant produces the same output, but rejects all non-numeric, non-M suffixes:
echo "23S62M1I19M2D" | sed 's/[0-9]\+[^0-9M]//g'

How can I get "grep -zoP" to display every match separately?

I have a file on this form:
X/this is the first match/blabla
X-this is
the second match-
and here we have some fluff.
And I want to extract everything that appears after "X" and between the same markers. So if I have "X+match+", I want to get "match", because it appears after "X" and between the marker "+".
So for the given sample file I would like to have this output:
this is the first match
and then
this is
the second match
I managed to get all the content between X followed by a marker by using:
grep -zPo '(?<=X(.))(.|\n)+(?=\1)' file
That is:
grep -Po '(?<=X(.))(.|\n)+(?=\1)' to match X followed by (something) that gets captured and matched at the end with (?=\1) (I based the code on my answer here).
Note I use (.|\n) to match anything, including a new line, and that I also use -z in grep to match new lines as well.
So this works well, the only problem comes from the display of the output:
$ grep -zPo '(?<=X(.))(.|\n)+(?=\1)' file
this is the first matchthis is
the second match
As you can see, all the matches appear together, with "this is the first match" being followed by "this is the second match" with no separator at all. I know this comes from the usage of "-z", that treats all the file as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline (quoting "man grep").
So: is there a way to get all these results separately?
I tried also in GNU Awk:
awk 'match($0, /X(.)(\n|.*)\1/, a) {print a[1]}' file
but not even the (\n|.*) worked.

awk doesn't support backreferences within regexp definition.
Workarounds:
$ grep -zPo '(?s)(?<=X(.)).+(?=\1)' ip.txt | tr '\0' '\n'
this is the first match
this is
the second match
# with ripgrep, which supports multiline matching
$ rg -NoUP '(?s)(?<=X(.)).+(?=\1)' ip.txt
this is the first match
this is
the second match
Can also use (?s)X(.)\K.+(?=\1) instead of (?s)(?<=X(.)).+(?=\1). Also, you might want to use non-greedy quantifier here to avoid matching match+xyz+foobaz for an input X+match+xyz+foobaz+
With perl
$ perl -0777 -nE 'say $& while(/X(.)\K.+(?=\1)/sg)' ip.txt
this is the first match
this is
the second match

Here is another gnu-awk solution making use of RS and RT:
awk -v RS='X.' 'ch != "" && n=index($0, ch) {
print substr($0, 1, n-1)
}
RT {
ch = substr(RT, 2, 1)
}' file
this is the first match
this is
the second match

With GNU awk for multi-char RS, RT, and gensub() and without having to read the whole file into memory:
$ awk -v RS='X.' 'NR>1{print "<" gensub(end".*","",1) ">"} {end=substr(RT,2,1)}' file
<this is the first match>
<this is
the second match>
Obviously I added the "<" and ">" so you could see where each output record starts/ends.
The above assumes that the character after X isn't a non-repetition regexp metachar (e.g. ., ^, [, etc.) so YMMV

The use case is kind of problematic, because as soon as you print the matches, you lose the information about where exactly the separator was. But if that's acceptable, try piping to xargs -r0.
grep -zPo '(?<=X(.))(.|\n)+(?=\1)' file | xargs -r0
These options are GNU extensions, but then so is grep -z and (mostly) grep -P, so perhaps that's acceptable.

GNU grep -z terminates input/output records with null characters (useful in conjunction with other tools such as sort -z). pcregrep will not do that:
pcregrep -Mo2 '(?s)X(.)(.+?)\1' file
-onumber used instead of lookarounds. ? lazy quantifier added (in case \1 occurs later).

pipe sed command to create multiple files

I need to get X to Y in the file with multiple occurrences, each time it matches an occurrence it will save to a file.
Here is an example file (demo.txt):
\x00START how are you? END\x00
\x00START good thanks END\x00
sometimes random things\x00\x00 inbetween it (ignore this text)
\x00START thats nice END\x00
And now after running a command each file (/folder/demo1.txt, /folder/demo2.txt, etc) should have the contents between \x00START and END\x00 (\x00 is null) in addition to 'START' but not 'END'.
/folder/demo1.txt should say "START how are you? ", /folder/demo2.txt should say "START good thanks".
So basicly it should pipe "how are you?" and using 'echo' I can prepend the 'START'.
It's worth keeping in mind that I am dealing with a very large binary file.
I am currently using
sed -n -e '/\x00START/,/END\x00/ p' demo.txt > demo1.txt
but that's not working as expected (it's getting lines before the '\x00START' and doesn't stop at the first 'END\x00').

If you have GNU awk, try:
awk -v RS='\0START|END\0' '
length($0) {printf "START%s\n", $0 > ("folder/demo"++i".txt")}
' demo.txt
RS='\0START|END\0' defines a regular expression acting as the [input] Record Separator which breaks the input file into records by strings (byte sequences) between \0START and END\0 (\0 represents NUL (null char.) here).
Using a multi-character, regex-based record separate is NOT POSIX-compliant; GNU awk supports it (as does mawk in general, but seemingly not with NUL chars.).
Pattern length($0) ensures that the associated action ({...}) is only executed if the records is nonempty.
{printf "START%s\n", $0 > ("folder/demo"++i)} outputs each nonempty record preceded by "START", into file folder/demo{n}.txt", where {n} represent a sequence number starting with 1.

You can use grep for that:
grep -Po "START\s+\K.*?(?=END)" file
how are you?
good thanks
thats nice
Explanation:
-P To allow Perl regex
-o To extract only matched pattern
-K Positive lookbehind
(?=something) Positive lookahead
EDIT: To match \00 as START and END may appear in between:
echo -e '\00START hi how are you END\00' | grep -aPo '\00START\K.*?(?=END\00)'
hi how are you
EDIT2: The solution using grep would only match single line, for multi-line it's better use perl instead. The syntax will be very similar:
echo -e '\00START hi \n how\n are\n you END\00' | perl -ne 'BEGIN{undef $/ } /\A.*?\00START\K((.|\n)*?)(?=END)/gm; print $1'
hi
how
are
you
What's new here:
undef $/ Undefine INPUT separator $/ which defaults to '\n'
(.|\n)* Dot matches almost any character, but it does not match
\n so we need to add it here.
/gm Modifiers, g for global m for multi-line

I would translate the nulls into newlines so that grep can find your wanted text on a clean line by itself:
tr '\000' '\n' < yourfile.bin | grep "^START"
from there you can take it into sed as before.

sed regex with multiple matches and condition

I would like to convert strings like:
abc=123.24|127.9|2891;xyz;hgy
to:
abc=123.24,127.9,2891;xyz;hgy
This is close:
echo "abc=123.24|127.9|2891;xyz;hgy" | sed -r 's/(=)([0-9.]+)\|/\1\2,/g'
but returns:
abc=123.24,127.9|2891;xyz;hgy
How can I do the rest of the numbers in a similar fashion if the number of bar-separated numbers is variable?
Clarification:
I hate it when people do not give me the whole picture in questions, but my original description above did just that. The small example is embedded in a much larger line that includes "|" separated text. I want to replace only the "|" with "," between numbers that follow an equal sign. Here is an entire line as an example:
chr1 69511 rs75062661 A G . QSS_ref ASP;BaseCounts=375,3,118,4;CAF=[0.348,0.652];COMMON=1;EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Aca/Gca|T141A|305|OR4F5|protein_coding|CODING|ENST00000335137|1|1);GNO;HRun=0;HaplotypeScore=0.0000;KGPROD;KGPhase1;LowMQ=0.0280,0.0580,500;MQ=49.32;MQ0=14;MSigDb=ACEVEDO_METHYLATED_IN_LIVER_CANCER_DN,KEGG_OLFACTORY_TRANSDUCTION,REACTOME_GPCR_DOWNSTREAM_SIGNALING,REACTOME_OLFACTORY_SIGNALING_PATHWAY,REACTOME_SIGNALING_BY_GPCR,chr1p36;NORMALT=86;NORMREF=228;NSM;NT=het;OTHERKG;QSS=8;QSS_NT=6;REF;RS=75062661;RSPOS=69511;S3D;SAO=0;SGT=AG->AG;SOMATIC;SSR=0;TQSS=1;TQSS_NT=2;TUMALT=15;TUMREF=227;TUMVAF=0.06198347107438017;TUMVARFRACTION=0.1485148514851485;VC=SNV;VLD;VP=0x050200000a05140116000100;WGT=1;dbNSFP_1000Gp1_AC=1424;dbNSFP_1000Gp1_AF=0.652014652014652;dbNSFP_1000Gp1_AFR_AC=162;dbNSFP_1000Gp1_AFR_AF=0.32926829268292684;dbNSFP_1000Gp1_AMR_AC=235;dbNSFP_1000Gp1_AMR_AF=0.649171270718232;dbNSFP_1000Gp1_ASN_AC=500;dbNSFP_1000Gp1_ASN_AF=0.8741258741258742;dbNSFP_1000Gp1_EUR_AC=527;dbNSFP_1000Gp1_EUR_AF=0.6952506596306068;dbNSFP_29way_logOdds=4.1978;dbNSFP_29way_pi=0.1516:0.0:0.6258:0.2226;dbNSFP_ESP6500_AA_AF=0.544101;dbNSFP_ESP6500_EA_AF=0.887429;dbNSFP_Ensembl_geneid=ENSG00000186092;dbNSFP_Ensembl_transcriptid=ENST00000534990|ENST00000335137;dbNSFP_FATHMM_score=0.51;dbNSFP_GERP++_NR=2.31;dbNSFP_GERP++_RS=1.15;dbNSFP_Interpro_domain=GPCR|_rhodopsin-like_superfamily_(1)|;dbNSFP_LRT_Omega=0.000000;dbNSFP_LRT_pred=N;dbNSFP_LRT_score=0.000427;dbNSFP_MutationAssessor_pred=neutral;dbNSFP_MutationAssessor_score=-1.295;dbNSFP_MutationTaster_pred=N;dbNSFP_MutationTaster_score=0.000162;dbNSFP_Polyphen2_HDIV_pred=B;dbNSFP_Polyphen2_HVAR_pred=B;dbNSFP_SIFT_score=0.950000;dbNSFP_Uniprot_aapos=141;dbNSFP_Uniprot_acc=Q8NH21;dbNSFP_Uniprot_id=OR4F5_HUMAN;dbNSFP_aaalt=A;dbNSFP_aapos=189|141;dbNSFP_aaref=T;dbNSFP_cds_strand=+;dbNSFP_codonpos=1;dbNSFP_fold-degenerate=0;dbNSFP_phyloP=0.267000;dbNSFP_refcodon=ACA;dbSNPBuildID=131 AU:CU:DP:FDP:GU:SDP:SUBDP:TU 228,232:3,3:322:4:86,109:0:0:1,2 227,228:0,0:244:1:15,16:0:0:1,2
The replacement in this line is of the string:
dbNSFP_aapos=189|141
with:
dbNSFP_aapos=189,141

why not:
sed 's/|/,/g'
kent$ echo "abc=123.24|127.9|2891;xyz;hgy"|sed 's/|/,/g'
abc=123.24,127.9,2891;xyz;hgy

Without using perl you can use
input="abc=123;def=hello123test;dbNSFP_aapos=189|141|142;dbNSFP_aaref=T;another=test|hello;"
sed -r 's/(.*=)([0-9.]+\|)+(.*)/\1'$(sed -r 's/(.*=)([0-9.]+\|)+(.*)/\2/' <<< $input | tr '|' ,)'\3/' <<< $input
Output:
abc=123;def=hello123test;dbNSFP_aapos=189,141,142;dbNSFP_aaref=T;another=test|hello;
Replace <<< $input with your file or whatever you actually have as input :)
Explanation:
We have three capturing groups in the regex (I restructured the groups from the OP), the second will contain only the string where the replacement of the | is to happen, while the first and third contain everything before and after the second group.
See the demo # regex101.
Within the second command ($(...)) we grab the second capturing group with sed and replace every | inside with a comma. This substitution is then used to be put in the place of the second group within the other sed-call.

As alternative, you can try with perl and its evalutation flag:
echo "..." | perl -pe 's{=([\d.|]+)}{"=" . (join ",", split /\|/, $1)}eg'
It searches for a string after an equal sign, splits it with | and join it with commas.

Using tr
echo "abc=123.24|127.9|2891;xyz;hgy" | tr \| ,
abc=123.24,127.9,2891;xyz;hgy

Assuming semicolon is your field separator, how about something like
tr `;\n' '\n;' | sed '/=[0-9.|]*$/s/|/,/g' | tr '\n;' ';\n'
This has a serious flaw; it fails in weird ways for the first and last fields on a line. If you can't live with that, maybe try this:
awk -F ';' '{ for(i=1; i<=NF; ++i) if ($i ~ /=([0-9.]+\|)+[0-9.]+$/) gsub(/\|/,",",$i); print }'

Best way to complete this Perl regex one-liner

I'm trying to use a Perl one-liner to munge some output from grepping svn diff, so I can automatically test the files. We have a run_test.sh script that can take multiple PHP files prepended with 'Test" as its arguments.
So far I have the following which successfully prepends 'Test' to the file names
[gjempty#gjempty-rhel4 classes]$ svn diff | grep '(revision' | perl -wpl -e 's/(.*)\/(.*)$/$1\/Test$2/'
--- commerce/TestLCart.php (revision 104387)
--- commerce/manufacturing/TestLRoutingData.php (revision 104387)
Now I'd just like to grab the file/path to pass it to our run_test.sh. I can finish it off with awk as below, but am trying to improve my Perl/one-liner skills. So how do I revise the perl one-liner to additionally extract only the file path?
svn diff | grep '(revision' | perl -wpl -e 's/(.*)\/(.*)$/$1\/Test$2/' | awk '{print $2}' | xargs run_test.sh

You're just wanting the file names, so svn st is what you want. Instead of getting large quantities of noise which could potentially contain (revision in it, and the main lines you want, you'll get it like this: M commerce/LCart.php. Then you can just chop off \S* (any number of non-whitespace characters) followed by \s* (any number of whitespace characters), and take what's left. You could do the \S*\s* differently, but that's the simplest way to get all cases.
svn st | perl -wpl -e 's|\S*\s*(.*)/(.*)$|$1/Test$2|'
(Switched it after posting from using s/// to s||| so the / doesn't need to be escaped; good idea, Axeman.)

You can get rid of the grep and the awk fairly easily.
svn diff | perl -wnl -e '/\(revision/ or next; m|(\S+)/(\S+)|; print "$1/Test$2";'
I changed the -p to -n. -p means while (<>) { <your code>; print $_; }, and -n is the same but without the print, since the new version has an explicit print instead.
Rather than an s/// substitution, I used an m// pattern match. I changed the delimiter to | to avoid backslashing the slash (a cause of Leaning Toothpick Syndrome). You can use almost any punctuation character you want.
\S is similar to . but matches only non-whitespace characters. Your .*s in the pattern were actually matching the entire chunks of the line before and after the slash, but the new pattern only matches the pathname of the file. Since the + is "greedy", the first one ($1) will get more string when there are multiple slashes in the pathname, the same as with your substitution pattern.

Better version:
No default print ( -n)
Extract substring first
Subst on that
print value
perl -wnl -e '($_)=m{---\s+(\S+)} and s|/([^/]+)$|/Test$1| and print "$_\n";'
You don't need awk now. And adding '(revision to the expression,
perl -wnl -e '($_)=m{---\s+(\S+)\s+\(revision} and s|/([^/]+)$|/Test$1| and print "$_\n";'
you don't need grep either.
But I have several subversion tools I created, and if all you want are the changed files 'svn st' is better.
svn st | perl -wnle 'm/^[CM]\s+(\S+)/and$r=rindex($1,"/")+1and print substr($1,0,$r),"Test",substr($1,$r+1),"\n"'
This time I chose a rindex + substr method. Now, there's no regex backtracking.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extract multiple occurrences on the same line using sed/regex - regex

This might work for you (but maybe only for your specific input line): sed 's/[^$]\(${[^}]\+}\)[^$]/\1\t/g;s/$[^{$]\+//g'

Related

find recurring pattern with `sed`

How can I get "grep -zoP" to display every match separately?

pipe sed command to create multiple files

sed regex with multiple matches and condition

Best way to complete this Perl regex one-liner

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extract multiple occurrences on the same line using sed/regex - regex

This might work for you (but maybe only for your specific input line): sed 's/[^$]*\(${[^}]\+}\)[^$]*/\1\t/g;s/$[^{$]\+//g'

Related

find recurring pattern with `sed`

How can I get "grep -zoP" to display every match separately?

pipe sed command to create multiple files

sed regex with multiple matches and condition

Best way to complete this Perl regex one-liner

Categories

Resources

This might work for you (but maybe only for your specific input line): sed 's/[^$]\(${[^}]\+}\)[^$]/\1\t/g;s/$[^{$]\+//g'