sed substitutions within captured group - regex

I think that I'm missing something obvious, but I've spent a lot of time Googling around and searching stackoverflow, and I can't find an answer. I'm looking for a way to perform an additional sed substitution on a captured group. So, for example, if I have the text:
MD5 (./x y.jpg) = 93845ad8b6fb2797d9f1d7e0622e9270
MD5 (./x y z.jpg) = 93845ad8b6fb2797d9f1d7e0622e9270
I'd like to replace the spaces within the parentheses with underscores and reformat the string to be the filenamemd5.
./x_y.jpg 93845ad8b6fb2797d9f1d7e0622e9270
./x_y_z.jpg 93845ad8b6fb2797d9f1d7e0622e9270
I'm able to capture the filename, but I don't know how to perform another substitution on the captured group.
echo 'MD5 (./x y.jpg) = 93845ad8b6fb2797d9f1d7e0622e9270' | sed 's/MD5 (\(.*\)) = \(.*\)/\1 \2/'
outputs:
./x y.jpg 93845ad8b6fb2797d9f1d7e0622e9270
Thoughts?

sed allows you to perform multiple substitutions (s/.../.../) in sequence, with each operating on the result of the previous:
echo 'MD5 (./x y.jpg) = 93845ad8b6fb2797d9f1d7e0622e9270' |
sed 's/MD5 (\(.*\)) = \(.*\)/\1|\2/; s/ /_/g; s/|/ /'
Here I've used a simple trick:
1st s command: I've used a | to temporarily separate the 2 backreferences.
2nd s: I then replace ALL spaces with _ chars.
3rd s: I replace the | with a space, so that a single space remains.
Of course, this trick only works if your input data never contains | chars.

You can use this sed,
sed 's/ /_/g; s/MD5_(\(.*\))_=_\(.*\)/\1 \2/'
Test:
sat:~$ echo 'MD5 (./x y z.jpg) = 93845ad8b6fb2797d9f1d7e0622e9270' |
sed 's/ /_/g; s/MD5_(\(.*\))_=_\(.*\)/\1 \2/'
./x_y_z.jpg 93845ad8b6fb2797d9f1d7e0622e9270

Related

find recurring pattern with `sed`

I am using GNU bash 4.3.48
I expected that
echo "23S62M1I19M2D" | sed 's/.*\([0-9]*M\).*/\1/g'
would output 62M19M... But it doesn't.
sed 's/\([0-9]*M\)//g' deletes ALL [0-9]*M and retrieves 23S1I2D. but the group \1 is not working as I thought it would.
sed 's/.*\([0-9]*M\).*/ \1 /g', retrieves M...
What am I doing wrong?
Thank you!
With your shown samples and with awk you could try following program.
echo "23S62M1I19M2D" |
awk '
{
val=""
while(match($0,/[0-9]+M/)){
val=val substr($0,RSTART,RLENGTH)
$0=substr($0,RSTART+RLENGTH)
}
print val
}
'
Explanation: Simple explanation would be, using echo to print values and sending it as a standard input to awk program. In awk program using its match function to match regex mentioned in it(/[0-9]+M) running loop to find all matches in each line and printing the collected matched values at last of each line.
This might work for you (GNU sed):
sed -nE '/[0-9]*M/{s//\n&\n/g;s/(^|\n)[^\n]*\n?//gp}' file
Surround the match by newlines and then remove non-matching parts.
Alternative, using grep and tr:
grep -o '[0-9]*M' file | tr -d '\n'
N.B. tr removes all newlines (including the last one) to restore the last newline, use:
grep -o '[0-9]*M' file | tr -d '\n' | paste
The alternate solution will concatenate all results into a single line. To achieve the same result with the first solution use:
sed -nE '/[0-9]*M/{s//\n&\n/g;s/(^|\n)[^\n]*\n?//g;H};${x;s/\n//gp}' file
The problem is that the .* is greedy. Since only M is obligatory, when the engine finds last M, it satisfies the regex, so all string is matched, M is captured and thus kept after replacing with \1 backreference.
That means, you can't easily do this with sed. You can do that with Perl much easier since it supports matching and skipping pattern:
#!/bin/bash
perl -pe 's/\d+M(*SKIP)(*F)|.//g' <<< "23S62M1I19M2D"
See the online demo. The pattern matches
\d+M(*SKIP)(*F) - one or more digits, M, and then the match is omitted and the next match is searched for from the failure position
|. - or matches any char other than a line break char.
Or simply match all occurrences and concatenate them:
perl -lane 'BEGIN{$a="";} while (/\d+M/g) {$a .= $&} END{print $a;}' <<< "23S62M1I19M2D"
All \d+M matches are appended to the $a variable which is printed at the end of processing the string.
Your substitution is probably working, but not substituting what you think it is.
In the substitution s/\(foo...\)/\1/, the \1 matches whatever \(...\) matches and captures, so your substitution is replacing foo... by foo...!
% echo "1234ABC" | sed 's/\([A-Z]\)/-\1-/'g
1234-A--B--C-
So you'll need to match more, but capture only a portion of the match. For example:
echo "23S62M1I19M2D" | sed 's/[0-9]*[A-LN-Z]*\([0-9]*M\)/\1/g'
62M19M2D
In the case of sed 's/.*\([0-9]*M\).*/\1/g' (did that appear in an edit to the question, or did I just miss it?), the .* matches ‘greedily’ – it matches as much as it possibly can, thus including the digits before the M. In the example above, the [A-LN-Z] is required to be at the end of the uncaptured part, so the digits are forced to be matched by the [0-9] inside the capture.
Getting a clear idea of what ‘greedy’ means is a really important idea when writing or interpreting regexps.
If you know you will only encounter the suffixes S, M, I and D, an alternative approach would be explicitly deleting the combinations you don't want:
echo "23S62M1I19M2D" | sed 's/[0-9]\+[SID]//g'
This gives the expected:
62M19M
Update: This variant produces the same output, but rejects all non-numeric, non-M suffixes:
echo "23S62M1I19M2D" | sed 's/[0-9]\+[^0-9M]//g'

How to extract text between first 2 dashes in the string using sed or grep in shell

I have the string like this feature/test-111-test-test.
I need to extract string till the second dash and change forward slash to dash as well.
I have to do it in Makefile using shell syntax and there for me doesn't work some regular expression which can help or this case
Finally I have to get smth like this:
input - feature/test-111-test-test
output - feature-test-111- or at least feature-test-111
feature/test-111-test-test | grep -oP '\A(?:[^-]++-??){2}' | sed -e 's/\//-/g')
But grep -oP doesn't work in my case. This regexp doesn't work as well - (.*?-.*?)-.*.
Another sed solution using a capture group and regex/pattern iteration (same thing Socowi used):
$ s='feature/test-111-test-test'
$ sed -E 's/\//-/;s/^(([^-]*-){3}).*$/\1/' <<< "${s}"
feature-test-111-
Where:
-E - enable extended regex support
s/\//-/ - replace / with -
s/^....*$/ - match start and end of input line
(([^-]-){3}) - capture group #1 that consists of 3 sets of anything not - followed by -
\1 - print just the capture group #1 (this will discard everything else on the line that's not part of the capture group)
To store the result in a variable:
$ url=$(sed -E 's/\//-/;s/^(([^-]*-){3}).*$/\1/' <<< "${s}")
$ echo $url
feature-test-111-
You can use awk keeping in mind that in Makefile the $ char in awk command must be doubled:
url=$(shell echo 'feature/test-111-test-test' | awk -F'-' '{gsub(/\//, "-", $$1);print $$1"-"$$2"-"}')
echo "$url"
# => feature-test-111-
See the online demo. Here, -F'-' sets the field delimiter as -, gsub(/\//, "-", $1) replaces / with - in Field 1 and print $1"-"$2"-" prints the value of --separated Field 1 and 2.
Or, with a regex as a field delimiter:
url=$(shell echo 'feature/test-111-test-test' | awk -F'[-/]' '{print $$1"-"$$2"-"$$3"-"}')
echo "$url"
# => feature-test-111-
The -F'[-/]' option sets the field separator to - and /.
The '{print $1"-"$2"-"$3"-"}' part prints the first, second and third value with a separating hyphen.
See the online demo.
To get the nth occurrence of a character C you don't need fancy perl regexes. Instead, build a regex of the form "(anything that isn't C, then C) for n times":
grep -Eo '([^-]*-){2}' | tr / -
With sed and cut
echo feature/test-111-test-test| cut -d'-' -f-2 |sed 's/\//-/'
Output
feature-test-111
echo feature/test-111-test-test| cut -d'-' -f-2 |sed 's/\//-/;s/$/-/'
Output
feature-test-111-
You can use the simple BRE regex form of not something then that something which is [^-]*- to get all characters other than - up to a -.
This works:
echo 'feature/test-111-test-test' | sed -nE 's/^([^/]*)\/([^-]*-[^-]*-).*/\1-\2/p'
feature-test-111-
Another idea using parameter expansions/substitutions:
s='feature/test-111-test-test'
tail="${s//\//-}" # replace '/' with '-'
# split first field from rest of fields ('-' delimited); do this 3x times
head="${tail%%-*}" # pull first field
tail="${tail#*-}" # drop first field
head="${head}-${tail%%-*}" # pull first field; append to previous field
tail="${tail#*-}" # drop first field
head="${head}-${tail%%-*}-" # pull first field; append to previous fields; add trailing '-'
$ echo "${head}"
feature-test-111-
A short sed solution, without extended regular expressions:
sed 's|\(.*\)/\([^-]*-[^-]*\).*|\1-\2|'

sed replace : delete a character between two strings while preserving a regex

I have a csv file which is delimited by #~#.
there is a field which contains 0 and then n(more than 1) number of '.'(dot).
I need to remove the zero and preserve the later dots. I have to also take care that floating numbers are not affected.
So effectively replace #~#0.....#~# to #~#.....#~# (dots can be from 1 to any)
To limit the replacement with fields matching the pattern use this
$ echo "#~#0.12#~#0.....#~#0.1#~#0.#~#" | sed -r 's/#~#0(\.+)#~#/#~#\1#~#/g'
will preserve 0.12 and 0.1 but replace 0..... and 0.
#~#0.12#~#.....#~#0.1#~#.#~#
+ in regex means one or more. Anchoring with the field delimiters will make sure nothing else will be replaced.
Using sed you can do:
s='#~#0.....#~#'
sed -r 's/(^|#~#)0(\.+($|#~#))/\1\2/g' <<< "$s"
#~#.....#~#
sed -r 's/(^|#~#)0(\.+($|#~#))/\1\2/g' <<< "#~#0.00#~#"
#~#0.00#~#
]$ echo "#~#0.....#~#" | sed 's/#0/#/g'
#~#.....#~#
Escape the dots ans include all characters that should match:
echo "#~#0.1234#~#0.....#~#" | sed 's/#~#0\.\./#~#../g'
Using var's will not improve much:
delim="#~#"
echo "#~#0.1234#~#0.....#~#" | sed "s/${delim}0\.\./${delim}../g"

sed regex with multiple matches and condition

I would like to convert strings like:
abc=123.24|127.9|2891;xyz;hgy
to:
abc=123.24,127.9,2891;xyz;hgy
This is close:
echo "abc=123.24|127.9|2891;xyz;hgy" | sed -r 's/(=)([0-9.]+)\|/\1\2,/g'
but returns:
abc=123.24,127.9|2891;xyz;hgy
How can I do the rest of the numbers in a similar fashion if the number of bar-separated numbers is variable?
Clarification:
I hate it when people do not give me the whole picture in questions, but my original description above did just that. The small example is embedded in a much larger line that includes "|" separated text. I want to replace only the "|" with "," between numbers that follow an equal sign. Here is an entire line as an example:
chr1 69511 rs75062661 A G . QSS_ref ASP;BaseCounts=375,3,118,4;CAF=[0.348,0.652];COMMON=1;EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Aca/Gca|T141A|305|OR4F5|protein_coding|CODING|ENST00000335137|1|1);GNO;HRun=0;HaplotypeScore=0.0000;KGPROD;KGPhase1;LowMQ=0.0280,0.0580,500;MQ=49.32;MQ0=14;MSigDb=ACEVEDO_METHYLATED_IN_LIVER_CANCER_DN,KEGG_OLFACTORY_TRANSDUCTION,REACTOME_GPCR_DOWNSTREAM_SIGNALING,REACTOME_OLFACTORY_SIGNALING_PATHWAY,REACTOME_SIGNALING_BY_GPCR,chr1p36;NORMALT=86;NORMREF=228;NSM;NT=het;OTHERKG;QSS=8;QSS_NT=6;REF;RS=75062661;RSPOS=69511;S3D;SAO=0;SGT=AG->AG;SOMATIC;SSR=0;TQSS=1;TQSS_NT=2;TUMALT=15;TUMREF=227;TUMVAF=0.06198347107438017;TUMVARFRACTION=0.1485148514851485;VC=SNV;VLD;VP=0x050200000a05140116000100;WGT=1;dbNSFP_1000Gp1_AC=1424;dbNSFP_1000Gp1_AF=0.652014652014652;dbNSFP_1000Gp1_AFR_AC=162;dbNSFP_1000Gp1_AFR_AF=0.32926829268292684;dbNSFP_1000Gp1_AMR_AC=235;dbNSFP_1000Gp1_AMR_AF=0.649171270718232;dbNSFP_1000Gp1_ASN_AC=500;dbNSFP_1000Gp1_ASN_AF=0.8741258741258742;dbNSFP_1000Gp1_EUR_AC=527;dbNSFP_1000Gp1_EUR_AF=0.6952506596306068;dbNSFP_29way_logOdds=4.1978;dbNSFP_29way_pi=0.1516:0.0:0.6258:0.2226;dbNSFP_ESP6500_AA_AF=0.544101;dbNSFP_ESP6500_EA_AF=0.887429;dbNSFP_Ensembl_geneid=ENSG00000186092;dbNSFP_Ensembl_transcriptid=ENST00000534990|ENST00000335137;dbNSFP_FATHMM_score=0.51;dbNSFP_GERP++_NR=2.31;dbNSFP_GERP++_RS=1.15;dbNSFP_Interpro_domain=GPCR|_rhodopsin-like_superfamily_(1)|;dbNSFP_LRT_Omega=0.000000;dbNSFP_LRT_pred=N;dbNSFP_LRT_score=0.000427;dbNSFP_MutationAssessor_pred=neutral;dbNSFP_MutationAssessor_score=-1.295;dbNSFP_MutationTaster_pred=N;dbNSFP_MutationTaster_score=0.000162;dbNSFP_Polyphen2_HDIV_pred=B;dbNSFP_Polyphen2_HVAR_pred=B;dbNSFP_SIFT_score=0.950000;dbNSFP_Uniprot_aapos=141;dbNSFP_Uniprot_acc=Q8NH21;dbNSFP_Uniprot_id=OR4F5_HUMAN;dbNSFP_aaalt=A;dbNSFP_aapos=189|141;dbNSFP_aaref=T;dbNSFP_cds_strand=+;dbNSFP_codonpos=1;dbNSFP_fold-degenerate=0;dbNSFP_phyloP=0.267000;dbNSFP_refcodon=ACA;dbSNPBuildID=131 AU:CU:DP:FDP:GU:SDP:SUBDP:TU 228,232:3,3:322:4:86,109:0:0:1,2 227,228:0,0:244:1:15,16:0:0:1,2
The replacement in this line is of the string:
dbNSFP_aapos=189|141
with:
dbNSFP_aapos=189,141
why not:
sed 's/|/,/g'
kent$ echo "abc=123.24|127.9|2891;xyz;hgy"|sed 's/|/,/g'
abc=123.24,127.9,2891;xyz;hgy
Without using perl you can use
input="abc=123;def=hello123test;dbNSFP_aapos=189|141|142;dbNSFP_aaref=T;another=test|hello;"
sed -r 's/(.*=)([0-9.]+\|)+(.*)/\1'$(sed -r 's/(.*=)([0-9.]+\|)+(.*)/\2/' <<< $input | tr '|' ,)'\3/' <<< $input
Output:
abc=123;def=hello123test;dbNSFP_aapos=189,141,142;dbNSFP_aaref=T;another=test|hello;
Replace <<< $input with your file or whatever you actually have as input :)
Explanation:
We have three capturing groups in the regex (I restructured the groups from the OP), the second will contain only the string where the replacement of the | is to happen, while the first and third contain everything before and after the second group.
See the demo # regex101.
Within the second command ($(...)) we grab the second capturing group with sed and replace every | inside with a comma. This substitution is then used to be put in the place of the second group within the other sed-call.
As alternative, you can try with perl and its evalutation flag:
echo "..." | perl -pe 's{=([\d.|]+)}{"=" . (join ",", split /\|/, $1)}eg'
It searches for a string after an equal sign, splits it with | and join it with commas.
Using tr
echo "abc=123.24|127.9|2891;xyz;hgy" | tr \| ,
abc=123.24,127.9,2891;xyz;hgy
Assuming semicolon is your field separator, how about something like
tr `;\n' '\n;' | sed '/=[0-9.|]*$/s/|/,/g' | tr '\n;' ';\n'
This has a serious flaw; it fails in weird ways for the first and last fields on a line. If you can't live with that, maybe try this:
awk -F ';' '{ for(i=1; i<=NF; ++i) if ($i ~ /=([0-9.]+\|)+[0-9.]+$/) gsub(/\|/,",",$i); print }'

How to seek forward and replace selected characters with sed

Can I use sed to replace selected characters, for example H => X, 1 => 2, but first seek forward so that characters in first groups are not replaced.
Sample data:
"Hello World";"Number 1 is there";"tH1s-Has,1,HHunKnownData";
How it should be after sed:
"Hello World";"Number 1 is there";"tX2s-Xas,2,XXunKnownData";
What I have tried:
Nothing really, I would try but everything I know about sed expressions seems to be wrong.
Ok, I have tried to capture ([^;]+) and "skip" (get em back using ´\1\2´...) first groups separated by ;, this is working fine but then comes problem, if I use capturing I need to select whole group and if I don't use capturing I'll lose data.
This is possible with sed, but is kinda tedious. To do the translation if field number $FIELD you can use the following:
sed 's/\(\([^;]*;\)\{'$((FIELD-1))'\}\)\([^;]*;\)/\1\n\3\n/;h;s/[^\n]*\n\([^\n]*\).*/\1/;y/H1/X2/;G;s/\([^\n]*\)\n\([^\n]*\)\n\([^\n]*\)\n\([^\n]*\)/\2\1\4/'
Or, reducing the number of brackets with GNU sed:
sed -r 's/(([^;]*;){'$((FIELD-1))'})([^;]*;)/\1\n\3\n/;h;s/[^\n]*\n([^\n]*).*/\1/;y/H1/X2/;G;s/([^\n]*)\n([^\n]*)\n([^\n]*)\n([^\n]*)/\2\1\4/'
Example:
$ FIELD=3
$ echo '"Hello World";"Number 1 is there";"tH1s-Has,1,HHunKnownData";' | sed -r 's/(([^;]*;){'$((FIELD-1))'})([^;]*;)/\1\n\3\n/;h;s/[^\n]*\n([^\n]*).*/\1/;y/H1/X2/;G;s/([^\n]*)\n([^\n]*)\n([^\n]*)\n([^\n]*)/\2\1\4/'
"Hello World";"Number 1 is there";"tX2s-Xas,2,XXunKnownData";
$ FIELD=2
$ echo '"Hello World";"Number 1 is there";"tH1s-Has,1,HHunKnownData";' | sed -r 's/(([^;]*;){'$((FIELD-1))'})([^;]*;)/\1\n\3\n/;h;s/[^\n]*\n([^\n]*).*/\1/;y/H1/X2/;G;s/([^\n]*)\n([^\n]*)\n([^\n]*)\n([^\n]*)/\2\1\4/'
"Hello World";"Number 2 is there";"tH1s-Has,1,HHunKnownData";
There may be a simpler way that I didn't think of, though.
If awk is ok for you:
awk -F";" '{gsub("H","X",$3);gsub("1","2",$3);}1' OFS=";" file
Using -F, the file is split with semi-colon as delimiter, and hence now the 3rd field($3) is of our interest. gsub function substitutes all occurences of H with X in the 3rd field, and again 1 to 2.
1 is to print every line.
[UPDATE]
(I just realized that it could be shorter. Perl has an auto-split mode):
$F[2] =~ s/H/X/g; $F[2] =~ s/1/2/g; $_=join(";",#F)
Perl is not known for being particularly readable, but in this case I suspect the best you can get with sed might not be as clear as with Perl:
echo '"Hello World";"Number 1 is there";"tH1s-Has,1,HHunKnownData";' |
perl -F';' -ape '$F[2] =~ s/H/X/g; $F[2] =~ s/1/2/g; $_=join(";",#F)'
Taking apart the Perl code:
# your groups are in #F, accessed as $F[$i]
$F[2] =~ s/H/X/g; # Do whatever you want with your chosen (Nth) group.
$F[2] =~ s/1/2/g;
$_ = join(";", #F) # Put them back together.
perl -pe is like sed. (sort of.)
and perl -F';' -ape means use auto-splitting (-a) and set the field separator to ';'. Then your groups are accessible via $F[i] - so it works slightly like awk, too.
So it would also work like perl -F';' -ape '/*your code*/' < inputfile
I know you asked for a sed solution - I often find myself switching to Perl (though I do still like sed) for one-liners.
awk -F";" '{gsub("H","X",$3);gsub("1","2",$3);}1' Your_file
This might work for you (GNU sed):
sed 's/H/X/2g;s/1/2/2g' file
This changes all but the first occurrence of H or 1 to X or 2 respectively
If it's by fields separated by ;'s, use:
sed 's/H[^;]*;/&\n/;h;y/H/X/;H;g;s/\n.*\n//;s/1[^;]*;/&\n/;h;y/1/2/;H;g;s/\n.*\n//' file
This can be mutated to cater for many values, so:
echo -e "H=X\n1=2"|
sed -r 's|(.*)=(.*)|s/\1[^;]*;/\&\\n/;h;y/\1/\2/;H;g;s/\\n.*\\n//|' |
sed -f - file