I have a file with the condition like this :
"one","two","three"" four","five"
So I want to remove the quotes mark within the double quotes, so the output be like this :
"one","two","three four ","five"
How can I do that with awk function and regular expression on ubuntu? Thanks...
You can simply look for "" and replace it by an empty string.
Like:
sed -i 's/""//' *.txt
For example:
echo '"one","two","three"" four","five"' | sed 's/""//'
"one","two","three four","five"
sed is the right tool for this.
$ echo '"one","two","three"" four","five"' | sed 's/\([^,]\)"\+\([^,]\)/\1\2/g'
"one","two","three four","five"
The above regex captures the character (character not of a comma) which exits before and after to one or more double quotes. So this would match the double quotes which exists at the center.
OR
$ echo '"one","two","three"" four","five"' | sed -r 's/([^,])"+([^,])/\1\2/g'
"one","two","three four","five"
[^,] matches any character but not of a comma.
([^,]) matched character was captured into group 1. It's like aa temporary storage area.
"+ one or more +
([^,]) captures the following character which won't be a comma.
\1\2 all the matched chars are replaced with the characters stored inside group index 1 and the group index 2.
Update:
$ echo '"one","two","three" vg " "gfh" four","five"' | sed -r 's/([^,])"+([^,])/\1\2/g;s/([^,])"+([^,])/\1\2/g'
"one","two","three vg gfh four","five"
Using awk you can do:
s="one","two","three"" four","five"'
awk 'BEGIN{FS=OFS=","} {for (i=1; i<=NF; i++) gsub(/""/, "", $i)} 1' <<< "$s"
"one","two","three four","five"
Related
I'm converting a double quoted CSV to pipeline delimited txt file in Unix.
I have used the following sed command to replace "," into | then remove starting and ending double quote.
sed -e 's/","/|/g' -e 's/"//g' filenm.csv > filenm.txt
But the file seems to have consecutive commas without double quotes and they are not getting replaced.
Col1|col2|col3|col4|col5|col6|col7|col8
Val1|val2|val3,,,,val7|val8
Now I want to convert all these consecutive commas to consecutive pipelines as they indicate empty or null fields.
And other fields also have commas inside field values which should not be altered.
I tried using below for that, but not working.
sed -e 's/,{1,\}/|{1,\}/g' filenm.csv > filenm.txt
sample csv file opened in notepad:
"ID","Name","DOB","Age","Address","City","State","Country","Phone number"
"123","ABC","12/20/2020","15","No.38,3rd st, RRR NNN, TRT",,,,"9999999999"
"456","DEF","12/20/2020",,,,,"test-country","9999999999"
"465","XYZ",,,"No.38,3rd st, RRR NNN, TRT",,,,"9999999999"
I hope this helps to reproduce the issue and resolve.
Thanks in advance....
This might work for you (GNU sed):
sed -E ':a;s/^(("[^",]*",+)*"[^",]*),/\1\n/;ta;y/,\n/|,/' file
Iteratively replace ,'s between "'s with newlines, then translate ,'s for |'s and newlines for ,'s.
You can use perl:
perl -pe 's/"([^"]*)"|,/defined($1) ? $1 : "|"/ge' filenm.csv > filenm.txt
Details:
"([^"]*)"|, - the regex pattern that matches ", then captures into Group 1 any zero or more chars other than " and then matches a ", or just matches a , in all other contexts
defined($1) ? $1 : "|" - RHS, replacement, that replaces the match either with Group 1 value (if Group 1 was matched) or with a | (if the , was matched)
ge - g stands for global (replaces all occurrences) and e makes Perl treat the RHS as a Perl expression.
See an online test:
#!/bin/bash
s='"ID","Name","DOB","Age","Address","City","State","Country","Phone number"
"123","ABC","12/20/2020","0","No.38,3rd st, RRR NNN, TRT",,,,"9999999999"'
perl -pe 's/"([^"]*)"|,/defined($1) ? $1 : "|"/ge' <<< "$s"
Output:
ID|Name|DOB|Age|Address|City|State|Country|Phone number
123|ABC|12/20/2020|0|No.38,3rd st, RRR NNN, TRT||||9999999999
Using awk:
awk -F \" '{ for(i=1;i<=NF;i++) { if ($i ~ /^[,]{2,}$/) { $i="," } } OFS="\"";gsub("\",\"","\"|\"",$0)}1' sample.csv
Explanation:
awk -F \" '{ # Set the field delimiter to double quote
for(i=1;i<=NF;i++) {
if ($i ~ /^[,]{2,}$/) {
$i="," # Loop through each field and if is contains 2 or more commas, set that field to one comma
}
}
OFS="\"";
gsub("\",\"","\"|\"",$0) # Substitute "," for "|"
}1' sample.csv
I would use GNU AWK for that following way. Let file.txt content be
"ID","Name","DOB","Age","Address","City","State","Country","Phone number"
"123","ABC","12/20/2020","15","No.38,3rd st, RRR NNN, TRT",,,,"9999999999"
"456","DEF","12/20/2020",,,,,"test-country","9999999999"
"465","XYZ",,,"No.38,3rd st, RRR NNN, TRT",,,,"9999999999"
then
awk 'BEGIN{FS="\"";OFS=""}{for(i=1;i<=NF;i+=2){$i=gensub(/,/,"|","g",$i)};print $0}' file.txt
output
ID|Name|DOB|Age|Address|City|State|Country|Phone number
123|ABC|12/20/2020|15|No.38,3rd st, RRR NNN, TRT||||9999999999
456|DEF|12/20/2020|||||test-country|9999999999
465|XYZ|||No.38,3rd st, RRR NNN, TRT||||9999999999
I assumed that first and last column is never empty. I use " as field separator and then in every odd field (these contain solely ,) I change all , to |. Finally I print whole such altered line.
(tested in GNU Awk 5.0.1)
I am doing something like this:
echo 'foo_bar_baz=foo_bar_baz' | sed -r 's/_([[:alnum:]])/\U\1/g'
and getting as result:
fooBarBaz=fooBarBaz
Is there a way of getting fooBarBaz=foo_bar_baz instead?
I tryed to do this, non-greedy:
echo 'foo_bar_baz=foo_bar_baz' | sed -r 's/([^=].*?)_([[:alnum:]])/\1\U\2/g'
but the result is this:
foo_bar_baz=foo_barBaz
What I need is to convert from:
foo_bar_baz=foo_bar_baz
to:
fooBarBaz=foo_bar_baz
EDIT: Adding more Generic solution which will work for more than 3 values before = too.
awk '
BEGIN{
FS=OFS="="
}
{
num=split($1,array,"_")
for(i=2;i<=num;i++){
val=(val?val:"")toupper(substr(array[i],1,1)) substr(array[i],2)
}
$1=array[1] val
val=""
}
1
' Input_file
This should be an easy task for awk.
echo 'foo_bar_baz=foo_bar_baz' | awk '
BEGIN{
FS=OFS="="
}
{
split($1,array,"_")
$1=array[1] toupper(substr(array[2],1,1)) substr(array[2],2) toupper(substr(array[3],1,1)) substr(array[3],2)
}
1'
To simply remove _ in first part use(this will not make letter capital):
echo 'foo_bar_baz=foo_bar_baz' | awk 'BEGIN{FS=OFS="="}{gsub(/_/,"",$1)} 1'
You may use
s='foo_bar_baz=foo_bar_baz'
sed -E ':a;s/^([^=_]*)_([[:alnum:]])/\1\U\2/g; ta' <<< "$s"
# => fooBarBaz=foo_bar_baz
See the online sed demo
Details
:a - define an a label to jump to if the substitution is a success
s/^([^=_]*)_([[:alnum:]])/\1\U\2/g - find
^ - start of string
([^=_]*) - Group 1 (\1 in the replacement pattern): any 0+ chars other than = and _
_ - an underscore
([[:alnum:]]) - Group 2 (\2 in the replacement pattern): an alphanumeric char
\1\U\2 - Group 1 value and then an uppercased Group 2 value
ta - t is a branch command making sed go back to the a label and repeat matching.
This might work for you (GNU sed):
sed -E 'h;s/_(.)/\u\1/g;G;s/=.*=/=/' file
Make a copy of the current line. Remove all _'s and uppercase the following characters. Append the copy and replace everything between ='s with a single =.
An alternative:
sed -E ':a;s/_(.*=)/\u\1/;ta' file
With GNU awk for the 3rd arg to match():
$ echo 'foo_bar_baz=foo_bar_baz' |
awk '{while (match($0,/(.*)_(.)(.*=.*)/,a)) $0 = a[1] toupper(a[2]) a[3]} 1'
fooBarBaz=foo_bar_baz
Note that the above solution is not restricted to any specific number of _s nor any specific letter following the underscores:
$ echo 'wee_sleekit_cowrin_timrous_beastie=foo_bar_baz' |
awk '{while (match($0,/(.*)_(.)(.*=.*)/,a)) $0 = a[1] toupper(a[2]) a[3]} 1'
weeSleekitCowrinTimrousBeastie=foo_bar_baz
Change _(.) to _([[:lower:]]) if you only want the underscores removed when followed by a lower case letter.
there is a string a_b_c_d. I want to replace _ with - in the string between a_ and _d. Below is processing.
echo "a_b_c_d" | sed -E 's/(.+)_(.+)_(.+)/\1`s/_/-/g \2`\3/g'
But it does not work. how can I reuse the \2 to replace its content?
Perl allows to use code in replacement section with e modifier
$ echo 'a_b_c_d' | perl -pe 's/a_\K.*(?=_d)/$&=~tr|_|-|r/e'
a_b-c_d
$ echo 'x_a_b_c_y' | perl -pe 's/x_\K.*(?=_y)/$&=~tr|_|-|r/e'
x_a-b-c_y
$&=~tr|_|-|r here $& is the matched portion, and tr is applied on that to replace _ to -
a_\K this will match a_ but won't be part of matched portion
(?=_d) positive lookahead to match _d but won't be part of matched portion
With sed (tested on GNU sed 4.2.2, not sure of syntax for other versions)
$ echo 'a_b_c_d' | sed -E ':a s/(a_.*)_(.*_d)/\1-\2/; ta'
a_b-c_d
$ echo 'x_a_b_c_y' | sed -E ':a s/(x_.*)_(.*_y)/\1-\2/; ta'
x_a-b-c_y
:a label a
s/(a_.*)_(.*_d)/\1-\2/ substitute one _ with - between a_ and _d
ta go to label a as long as the substitution succeeds
gnu sed:
$ sed -r 's/_/-/g;s/(^[^-]+)-/\1_/;s/-([^-]+$)/_\1/' <<<'x_a_b_c_y'
x_a-b-c_y
The idea is, replacing all _ by -, then restoring the ones you want to keep.
update
if the fields separated by _ contains -, we can make use ge of gnu sed:
sed -r 's/(^[^_]+_)(.*)(_[^_]+$)/echo "\1"$(echo "\2"\|sed "s|_|-|g")"\3"/ge'
For example we want ----_f-o-o_b-a-r_---- to be ----_f-o-o-b-a-r_----:
sed -r 's/(^[^_]+_)(.*)(_[^_]+$)/echo "\1"$(echo "\2"\|sed "s|_|-|g")"\3"/ge' <<<'----_f-o-o_b-a-r_----'
----_f-o-o-b-a-r_----
Following Kent's suggestion, and if you do not need a general solution, this works:
$ echo 'a_b_c+d_x' | tr '_' '-' | sed -E 's/^([a-z]+)-(.+)-([a-z]+)$/\1_\2_\3/g'
$ a_b-c+d_x
The character classes should be adjusted to match the leading and trailing parts of your input string. Fails, of course, if a or x contain the '-' character.
I would like to convert strings like:
abc=123.24|127.9|2891;xyz;hgy
to:
abc=123.24,127.9,2891;xyz;hgy
This is close:
echo "abc=123.24|127.9|2891;xyz;hgy" | sed -r 's/(=)([0-9.]+)\|/\1\2,/g'
but returns:
abc=123.24,127.9|2891;xyz;hgy
How can I do the rest of the numbers in a similar fashion if the number of bar-separated numbers is variable?
Clarification:
I hate it when people do not give me the whole picture in questions, but my original description above did just that. The small example is embedded in a much larger line that includes "|" separated text. I want to replace only the "|" with "," between numbers that follow an equal sign. Here is an entire line as an example:
chr1 69511 rs75062661 A G . QSS_ref ASP;BaseCounts=375,3,118,4;CAF=[0.348,0.652];COMMON=1;EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Aca/Gca|T141A|305|OR4F5|protein_coding|CODING|ENST00000335137|1|1);GNO;HRun=0;HaplotypeScore=0.0000;KGPROD;KGPhase1;LowMQ=0.0280,0.0580,500;MQ=49.32;MQ0=14;MSigDb=ACEVEDO_METHYLATED_IN_LIVER_CANCER_DN,KEGG_OLFACTORY_TRANSDUCTION,REACTOME_GPCR_DOWNSTREAM_SIGNALING,REACTOME_OLFACTORY_SIGNALING_PATHWAY,REACTOME_SIGNALING_BY_GPCR,chr1p36;NORMALT=86;NORMREF=228;NSM;NT=het;OTHERKG;QSS=8;QSS_NT=6;REF;RS=75062661;RSPOS=69511;S3D;SAO=0;SGT=AG->AG;SOMATIC;SSR=0;TQSS=1;TQSS_NT=2;TUMALT=15;TUMREF=227;TUMVAF=0.06198347107438017;TUMVARFRACTION=0.1485148514851485;VC=SNV;VLD;VP=0x050200000a05140116000100;WGT=1;dbNSFP_1000Gp1_AC=1424;dbNSFP_1000Gp1_AF=0.652014652014652;dbNSFP_1000Gp1_AFR_AC=162;dbNSFP_1000Gp1_AFR_AF=0.32926829268292684;dbNSFP_1000Gp1_AMR_AC=235;dbNSFP_1000Gp1_AMR_AF=0.649171270718232;dbNSFP_1000Gp1_ASN_AC=500;dbNSFP_1000Gp1_ASN_AF=0.8741258741258742;dbNSFP_1000Gp1_EUR_AC=527;dbNSFP_1000Gp1_EUR_AF=0.6952506596306068;dbNSFP_29way_logOdds=4.1978;dbNSFP_29way_pi=0.1516:0.0:0.6258:0.2226;dbNSFP_ESP6500_AA_AF=0.544101;dbNSFP_ESP6500_EA_AF=0.887429;dbNSFP_Ensembl_geneid=ENSG00000186092;dbNSFP_Ensembl_transcriptid=ENST00000534990|ENST00000335137;dbNSFP_FATHMM_score=0.51;dbNSFP_GERP++_NR=2.31;dbNSFP_GERP++_RS=1.15;dbNSFP_Interpro_domain=GPCR|_rhodopsin-like_superfamily_(1)|;dbNSFP_LRT_Omega=0.000000;dbNSFP_LRT_pred=N;dbNSFP_LRT_score=0.000427;dbNSFP_MutationAssessor_pred=neutral;dbNSFP_MutationAssessor_score=-1.295;dbNSFP_MutationTaster_pred=N;dbNSFP_MutationTaster_score=0.000162;dbNSFP_Polyphen2_HDIV_pred=B;dbNSFP_Polyphen2_HVAR_pred=B;dbNSFP_SIFT_score=0.950000;dbNSFP_Uniprot_aapos=141;dbNSFP_Uniprot_acc=Q8NH21;dbNSFP_Uniprot_id=OR4F5_HUMAN;dbNSFP_aaalt=A;dbNSFP_aapos=189|141;dbNSFP_aaref=T;dbNSFP_cds_strand=+;dbNSFP_codonpos=1;dbNSFP_fold-degenerate=0;dbNSFP_phyloP=0.267000;dbNSFP_refcodon=ACA;dbSNPBuildID=131 AU:CU:DP:FDP:GU:SDP:SUBDP:TU 228,232:3,3:322:4:86,109:0:0:1,2 227,228:0,0:244:1:15,16:0:0:1,2
The replacement in this line is of the string:
dbNSFP_aapos=189|141
with:
dbNSFP_aapos=189,141
why not:
sed 's/|/,/g'
kent$ echo "abc=123.24|127.9|2891;xyz;hgy"|sed 's/|/,/g'
abc=123.24,127.9,2891;xyz;hgy
Without using perl you can use
input="abc=123;def=hello123test;dbNSFP_aapos=189|141|142;dbNSFP_aaref=T;another=test|hello;"
sed -r 's/(.*=)([0-9.]+\|)+(.*)/\1'$(sed -r 's/(.*=)([0-9.]+\|)+(.*)/\2/' <<< $input | tr '|' ,)'\3/' <<< $input
Output:
abc=123;def=hello123test;dbNSFP_aapos=189,141,142;dbNSFP_aaref=T;another=test|hello;
Replace <<< $input with your file or whatever you actually have as input :)
Explanation:
We have three capturing groups in the regex (I restructured the groups from the OP), the second will contain only the string where the replacement of the | is to happen, while the first and third contain everything before and after the second group.
See the demo # regex101.
Within the second command ($(...)) we grab the second capturing group with sed and replace every | inside with a comma. This substitution is then used to be put in the place of the second group within the other sed-call.
As alternative, you can try with perl and its evalutation flag:
echo "..." | perl -pe 's{=([\d.|]+)}{"=" . (join ",", split /\|/, $1)}eg'
It searches for a string after an equal sign, splits it with | and join it with commas.
Using tr
echo "abc=123.24|127.9|2891;xyz;hgy" | tr \| ,
abc=123.24,127.9,2891;xyz;hgy
Assuming semicolon is your field separator, how about something like
tr `;\n' '\n;' | sed '/=[0-9.|]*$/s/|/,/g' | tr '\n;' ';\n'
This has a serious flaw; it fails in weird ways for the first and last fields on a line. If you can't live with that, maybe try this:
awk -F ';' '{ for(i=1; i<=NF; ++i) if ($i ~ /=([0-9.]+\|)+[0-9.]+$/) gsub(/\|/,",",$i); print }'
I've got a CSV file that looks like:
1,3,"3,5",4,"5,5"
Now I want to change all the "," not within quotes to ";" with sed, so it looks like this:
1;3;"3,5";5;"5,5"
But I can't find a pattern that works.
If you are expecting only numbers then the following expression will work
sed -e 's/,/;/g' -e 's/\("[0-9][0-9]*\);\([0-9][0-9]*"\)/\1,\2/g'
e.g.
$ echo '1,3,"3,5",4,"5,5"' | sed -e 's/,/;/g' -e 's/\("[0-9][0-9]*\);\([0-9][0-9]*"\)/\1,\2/g'
1;3;"3,5";4;"5,5"
You can't just replace the [0-9][0-9]* with .* to retain any , in that is delimted by quotes, .* is too greedy and matches too much. So you have to use [a-z0-9]*
$ echo '1,3,"3,5",4,"5,5",",6","4,",7,"a,b",c' | sed -e 's/,/;/g' -e 's/\("[a-z0-9]*\);\([a-z0-9]*"\)/\1,\2/g'
1;3;"3,5";4;"5,5";",6";"4,";7;"a,b";c
It also has the advantage over the first solution of being simple to understand. We just replace every , by ; and then correct every ; in quotes back to a ,
You could try something like this:
echo '1,3,"3,5",4,"5,5"' | sed -r 's|("[^"]*),([^"]*")|\1\x1\2|g;s|,|;|g;s|\x1|,|g'
which replaces all commas within quotes with \x1 char, then replaces all commas left with semicolons, and then replaces \x1 chars back to commas. This might work, given the file is correctly formed, there're initially no \x1 chars in it and there're no situations where there is a double quote inside double quotes, like "a\"b".
Using gawk
gawk '{$1=$1}1' FPAT="([^,]+)|(\"[^\"]+\")" OFS=';' filename
Test:
[jaypal:~/Temp] cat filename
1,3,"3,5",4,"5,5"
[jaypal:~/Temp] gawk '{$1=$1}1' FPAT='([^,]+)|(\"[^\"]+\")' OFS=';' filename
1;3;"3,5";4;"5,5"
This might work for you:
echo '1,3,"3,5",4,"5,5"' |
sed 's/\("[^",]*\),\([^"]*"\)/\1\n\2/g;y/,/;/;s/\n/,/g'
1;3;"3,5";4;"5,5"
Here's alternative solution which is longer but more flexible:
echo '1,3,"3,5",4,"5,5"' |
sed 's/^/\n/;:a;s/\n\([^,"]\|"[^"]*"\)/\1\n/;ta;s/\n,/;\n/;ta;s/\n//'
1;3;"3,5";4;"5,5"