SEd: replace whitespace characters with single comma except inside quotes - regex

This line is from a car dataset (https://archive.ics.uci.edu/ml/datasets/Auto+MPG)
looking like this:
15.0 8. 429.0 198.0 4341. 10.0 70. 1. "ford galaxie 500"
how would one replace the multiple whitespace (it has both space and tabs) w/ a single comma, but not inside the quotes, preferably using sed,to turn the dataset into a REAL csv. Thanks!

Do it with awk:
awk -F'"' 'BEGIN { OFS="\"" } { for(i = 1; i <= NF; i += 2) { gsub(/[ \t]+/, ",", $i); } print }' filename.csv
Using " as the field separator, every second field is going to be a part of the line where spaces should be replaced. Then:
BEGIN { OFS = FS } # output should also be separated by "
{
for(i = 1; i <= NF; i += 2) { # in every second field
gsub(/[ \t]+/, ",", $i) # replace spaces with commas
}
print # and print the whole shebang
}

This might work for you (GNU sed):
sed 's/\("[^"]*"\|[0-9.]*\)\s\s*/\1,/g' file
This takes a quoted string or a decimal number followed by white space and replaces the white space by a comma - throughout each and every line.
To be less specific use (as per comments):
sed -r 's/("[^"]*"|\S+)\s+/\1,/g' file

Related

Removing multiple delimiters between outside delimiters on each line

Using awk or sed in a bash script, I need to remove comma separated delimiters that are located between an inner and outer delimiter. The problem is that wrong values ends up in the wrong columns, where only 3 columns are desired.
For example, I want to turn this:
2020/11/04,Test Account,569.00
2020/11/05,Test,Account,250.00
2020/11/05,More,Test,Accounts,225.00
Into this:
2020/11/04,Test Account,569.00
2020/11/05,Test Account,250.00
2020/11/05,More Test Accounts,225.00
I've tried to use a few things, testing regex:
But I cannot find a solution to only select the commas in order to remove.
awk -F, '{ printf "%s,",$1;for (i=2;i<=NF-2;i++) { printf "%s ",$i };printf "%s,%s\n",$(NF-1),$NF }' file
Using awk, print the first comma delimited field and then loop through the rest of the field up to the last but 2 field printing the field followed by a space. Then for the last 2 fields print the last but one field, a comma and then the last field.
With GNU awk for the 3rd arg to match():
$ awk -v OFS=, '{
match($0,/([^,]*),(.*),([^,]*)/,a)
gsub(/,/," ",a[2])
print a[1], a[2], a[3]
}' file
2020/11/04,Test Account,569.00
2020/11/05,Test Account,250.00
2020/11/05,More Test Accounts,225.00
or with any awk:
$ awk '
BEGIN { FS=OFS="," }
{
n = split($0,a)
gsub(/^[^,]*,|,[^,]*$/,"")
gsub(/,/," ")
print a[1], $0, a[n]
}
' file
2020/11/04,Test Account,569.00
2020/11/05,Test Account,250.00
2020/11/05,More Test Accounts,225.00
Use this Perl one-liner:
perl -F',' -lane 'print join ",", $F[0], "#F[1 .. ($#F-1)]", $F[-1];' in.csv
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-F',' : Split into #F on comma, rather than on whitespace.
$F[0] : first element of the array #F (= first comma-delimited value).
$F[-1] : last element of #F.
#F[1 .. ($#F-1)] : elements of #F between the second from the start and the second from the end, inclusive.
"#F[1 .. ($#F-1)]" : the above elements, joined on blanks into a string.
join ",", ... : join the LIST "..." on a comma, and return the resulting string.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perl -pe 's{,\K.*(?=,)}{$& =~ y/,/ /r}e' file
sed -e ':a' -e 's/\(,[^,]*\),\([^,]*,\)/\1 \2/; t a' file
awk '{$1=$1","; $NF=","$NF; gsub(/ *, */,","); print}' FS=, file
awk '{for (i=2; i<=NF; ++i) $i=(i>2 && i<NF ? " " : ",") $i} 1' FS=, OFS= file
awk doesn't support look arounds, we could have it by using match function of awk; using that could you please try following, written and tested with shown samples in GNU awk.
awk '
match($0,/,.*,/){
val=substr($0,RSTART+1,RLENGTH-2)
gsub(/,/," ",val)
print substr($0,1,RSTART) val substr($0,RSTART+RLENGTH-1)
}
' Input_file
Yet another perl
$ perl -pe 's/(?:^[^,]*,|,[^,]*$)(*SKIP)(*F)|,/ /g' ip.txt
2020/11/04,Test Account,569.00
2020/11/05,Test Account,250.00
2020/11/05,More Test Accounts,225.00
(?:^[^,]*,|,[^,]*$) matches first/last field along with the comma character
(*SKIP)(*F) this would prevent modification of preceding regexp
|, provide , as alternate regexp to be matched for modification
With sed (assuming \n is supported by the implementation, otherwise, you'll have to find a character that cannot be present in the input)
sed -E 's/,/\n/; s/,([^,]*)$/\n\1/; y/,/ /; y/\n/,/'
s/,/\n/; s/,([^,]*)$/\n\1/ replace first and last comma with newline character
y/,/ / replace all comma with space
y/\n/,/ change newlines back to comma
A similar answer to Timur's, in awk
awk '
BEGIN { FS = OFS = "," }
function join(start, stop, sep, str, i) {
str = $start
for (i = start + 1; i <= stop; i++) {
str = str sep $i
}
return str
}
{ print $1, join(2, NF-1, " "), $NF }
' file.csv
It's a shame awk doesn't ship with a join function builtin

How to use sed to extract numbers from a comma separated string?

I managed to extract the following response and comma separate it. It's comma seperated string and I'm only interested in comma separated values of the account_id's. How do you pattern match using sed?
Input: ACCOUNT_ID,711111111119,ENVIRONMENT,dev,ACCOUNT_ID,111111111115,dev
Expected Output: 711111111119, 111111111115
My $input variable stores the input
I tried the below but it joins all the numbers and I would like to preserve the comma ','
echo $input | sed -e "s/[^0-9]//g"
I think you're better served with awk:
awk -v FS=, '{for(i=1;i<=NF;i++)if($i~/[0-9]/){printf sep $i;sep=","}}'
If you really want sed, you can go for
sed -e "s/[^0-9]/,/g" -e "s/,,*/,/g" -e "s/^,\|,$//g"
$ awk '
BEGIN {
FS = OFS = ","
}
{
c = 0
for (i = 1; i <= NF; i++) {
if ($i == "ACCOUNT_ID") {
printf "%s%s", (c++ ? OFS : ""), $(i + 1)
}
}
print ""
}' file
711111111119,111111111115

Remove string and add sequential number to file headers using awk or sed

I have following input:
>Thimo_0001|ID:40710520| hypothetical protein [Thioflavicoccus mobilis 8321]
LIAPTMILRIRLTEFCPMRTEGFEE
TGIGPLDSRMPRYDDVVHHREIIT
YPPEALSNDPFDPTSIDGSPSAFF*
>ThimoAM_0002|ID:40707134| protein of unknown function [Thioflavicoccus mobilis 8321]
VRKAERDSPCKRRGADRSFP
KSARLISSKAFRDVFAESITNSDPFFVVR
ARPNLAETARLGIAVSKKCARRSVDRSRIKRII
RESFRWVRNDLPVMDYVVIARHAAVKRTNPRLFESLRSHWTKFSEPDA*
>Thimo_0002|ID:40710524| ribonuclease P protein component [Thioflavicoccus mobilis 8321]
MILLIRLRSTDRRAHFFDTAIPNLAVSARLGRAR
TTKNGSEFVMDSAKTSRNAFEEISLADFGKERSAP
RRLQGESLSAFRTTRGQDEPATFRCPTRPKPMCMRAL*
And I would like to
remove the linebreaks in lines after the header starting with >
remove the asterisk
change the fasta header
I could do 1. and 2.
awk '!/^>/ { printf "%s", $0; n = "\n" } /^>/ { print n $0; n = "" } END { printf "%s", n }'
sed "s/\*//g"
and I can also add a sequential number to the end of the header line:
awk '/^>/{$0=$0"_"(++i)}1'
but I am failing at the last step with the replacing/removing and adding a sequential number:
desired output
>TM0001|hypothetical_protein
LIAPTMILRIRLTEFCPMRTEGFEETGIGPLDSRMPRYDDVVHHREIITYPPEALSNDPFDPTSIDGSPSAFF
>TM0002|protein_of_unknown_function
VRKAERDSPCKRRGADRSFPKSARLISSKAFRDVFAESITNSDPFFVVRARPNLAETARLGIAVSKKCARRSVDRSRIKRIIRESFRWVRNDLPVMDYVVIARHAAVKRTNPRLFESLRSHWTKFSEPDA
>TM0003|ribonuclease_P_protein_component
MILLIRLRSTDRRAHFFDTAIPNLAVSARLGRARTTKNGSEFVMDSAKTSRNAFEEISLADFGKERSAPRRLQGESLSAFRTTRGQDEPATFRCPTRPKPMCMRAL
According to your "desired" output - gawk solution:
awk 'BEGIN{ RS=">"; FS="[|\\]\\[]" }!$0{ next }
{ gsub(/^ */,"",$3); gsub(/[*[:space:]]/,"",$5); printf(">TM%04d|%s\n%s\n",++c,$3,$5)
}' yourfile
The output:
>TM0001|hypothetical protein
LIAPTMILRIRLTEFCPMRTEGFEETGIGPLDSRMPRYDDVVHHREIITYPPEALSNDPFDPTSIDGSPSAFF
>TM0002|protein of unknown function
VRKAERDSPCKRRGADRSFPKSARLISSKAFRDVFAESITNSDPFFVVRARPNLAETARLGIAVSKKCARRSVDRSRIKRIIRESFRWVRNDLPVMDYVVIARHAAVKRTNPRLFESLRSHWTKFSEPDA
>TM0003|ribonuclease P protein component
MILLIRLRSTDRRAHFFDTAIPNLAVSARLGRARTTKNGSEFVMDSAKTSRNAFEEISLADFGKERSAPRRLQGESLSAFRTTRGQDEPATFRCPTRPKPMCMRAL
Details:
RS=">" - considering > as record separator
FS="[|\\]\\[]" - field separator, any of characters |[]
!$0{ next } - skip empty records
gsub(/^ */,"",$3) - remove leading spaces in the 3rd field
gsub(/[*[:space:]]/,"",$5) - replace/remove asterisk * and whitespace characters within the 5th field

Removing Extra commas from Comma delimited file

I have a comma delimited file with 12 columns.
There is problem with 5th and 6th columns (the text in 5th and 6th column is identical, but may have extra commas between them) which contains extra commas.
2011,123456,1234567,12345678,Hey There,How are you,Hey There,How are you,882864309037,ABC ABCD,LABACD,1.00000000,80.2500000,One Two
So in the above example "Hey There,How are you" should not have a comma.
I need to remove extra commas in 5th and 6th column.
If you always want to remove the 5th comma, try
sed 's/,//5' input.txt
But you are saying, it may have extra commas. You have to provide a logic how to find out if extra commas are there or not.
If you know the number of commas, you can use awk. This has proven to be quite an exercise, I am sure someone else will come up with a more elegant solution, but I'll share mine anyway:
awk -f script.awk input.txt
with script.awk:
BEGIN{
FS=","
}
NF<=12{
print $0
}
NF>12{
for (i=1; i<=4; i++) printf $i FS
for (j=0; j<2; j++){
for (i=0; i<=(NF-12)/2; i++){
printf $(i+5)
if (i<(NF-12)/2) printf "_"
else printf FS
}
}
for (i=NF-5; i<=NF; i++) printf $i FS
printf "n"
}
First we set the field separator to ,. If we count less or equal to 12 fields, everything's fine and we simply print the whole line. If there are more than 12 fields, we print first the first 4 fields (again with the field separator), and then we print twice field 5 (and field 6), but instead of printing the ,, we exchange it with _. In the end we print the remaining fields.
As I said, there is probably a more elegant solution to this. I wonder with what other people come up.
If all other fields are digital, you can try to save useful commas by that criteria.
sed -r 's/(,)[0-9]/;/g' a | sed -r 's/[0-9](,)/;/g' | sed -r 's/,//g' | awk -F\; '{ print $1 "," $2 "," $3 "," $4 "," substr($5, 0, length($5)/2) "," substr($5, length($5)/2 +1, length($5)/2) "," $6 "," $7}'
2011,23456,234567,234567,Hey ThereHow are you,Hey ThereHow are you,8286430903,
You can try with perl and its Text::CSV_XS module:
#!/usr/bin/env perl
use warnings;
use strict;
use Text::CSV_XS;
my (#columns);
open my $fh, '<', shift or die;
my $csv = Text::CSV_XS->new or die;
while ( my $row = $csv->getline( $fh ) ) {
undef #columns;
if ( #$row <= 12 ) {
#columns = #$row;
next;
}
my $extra_columns = ( #$row - 12 ) / 2;
my $post_columns_index = 4 + 2 * $extra_columns * 2;
#columns = (
#$row[0..3],
(join( '', #$row[4..(4+$extra_columns)] )) x 2,
#$row[$post_columns_index..$#$row]
);
}
continue {
$csv->print( \*STDOUT, \#columns );
printf "\n";
}
Assuming an input file (infile) with three lines, where the first one has an additional comma, the second one has two additional commas and the third one is correct:
2011,123456,1234567,12345678,Hey There,How are you,Hey There,How are you,882864309037,ABC ABCD,LABACD,1.00000000,80.2500000,One Two
2011,123456,1234567,12345678,Hey There,How are you,now,Hey There,How are you,now,882864309037,ABC ABCD,LABACD,1.00000000,80.2500000,One Two
2011,123456,1234567,12345678,Hey There:How are you,Hey There:How are you,882864309037,ABC ABCD,LABACD,1.00000000,80.2500000,One Two
Run the script like:
perl script.pl infile
That yields:
2011,123456,1234567,12345678,"Hey ThereHow are you","Hey ThereHow are you",882864309037,"ABC ABCD",LABACD,1.00000000,80.2500000,"One Two"
2011,123456,1234567,12345678,"Hey ThereHow are younow","Hey ThereHow are younow",LABACD,1.00000000,80.2500000,"One Two"
2011,123456,1234567,12345678,"Hey There:How are you","Hey There:How are you",882864309037,"ABC ABCD",LABACD,1.00000000,80.2500000,"One Two"
Note that it adds some quotes but it's correct based in the csv specification and easier to handle that the previous state.

Find specific words and replace with capitals

I need to use Unix and create an awk script. The first part of the script is to find the words "Ant" "Ass" and "Ape" in a text file and replace them with the same word but capitalized.
Do I use gsub to find each occurrence? If i do:
{gsub(/Ass/, "ASS"); print}
{gsub(/Ape/, "APE"); print}
{gsub(/Ant/, 'ANT"); print}
it just prints every line of the file 3 or 4 times... how can I search and replace these three words and then print out only the modified line?
The second part of the program is to track the number of lines with matches to Ass, Ape, or Ant and the number of substitutions made.
Thanks for your help!
Do all the substitutions in a single clause:
{subs += gsub(/Ass/, "ASS"); subs += gsub(/Ape/, "APE"); subs += gsub(/Ant/, "ANT"); print; }
END { print "Total substitutions:", subs; }
sed 's/Ant/ANT/g; s/Ass/ASS/g; s/Ape/APE/'
Another way:
awk '
BEGIN {IGNORECASE=1}
{
s = 0
while (match(substr($0, s),/ass|ape|ant/) > 0) {
c=substr($0,s + RSTART - 1,RLENGTH)
sub(c,toupper(c))
s += RSTART + RLENGTH
}
print
}' input