How to remove identical columns in a csv file using Bash - regex

There are already a lot of questions like this but neither of them did help me. I want to keep this simple:
I have a file (more than 90 columns) like:
Class,Gene,col3,Class,Gene,col6,Class
A,FF,23,A,FF,16,A
B,GG,45,B,GG,808,B
C,BB,43,C,BB,76,C
I want to keep unique columns so the desired output should be:
Class,Gene,col3,col6
A,FF,23,16
B,GG,45,808
C,BB,43,76
I used awk '!a[$0]++' but it did not remove the repeated columns of the file.
As a side note: I have repetitive columns because I used paste command to join different files column-wise.

You may use this awk to print unique columns based on their names in first header row:
awk 'BEGIN {
FS=OFS="," # set input/output field separators as comma
}
NR == 1 { # for first header row
for (i=1; i<=NF; i++) # loop through all columns
if (!ucol[$i]++) # if col name is not in a unique array
hdr[i] # then store column no. in an array hdr
}
{
for (i=1; i<=NF; i++) # loop through all columns
if (i in hdr) # if col no. is found in array hdr then print
printf "%s",(i==1?"":OFS) $i # then print col with OFS
print "" # print line break
}' file
Class,Gene,col3,col6
A,FF,23,16
B,GG,45,808
C,BB,43,76

For your specific case where you're just trying to remove 2 cols added by paste per original file all you need is:
$ awk '
BEGIN { FS=OFS="," }
{ r=$1 OFS $2; for (i=3; i<=NF; i+=3) r=r OFS $i; print r }
' file
Class,Gene,col3,col6
A,FF,23,16
B,GG,45,808
C,BB,43,76
but in other situations where it's not as simple: create an array (f[] below) that maps output field numbers (determined based on uniqueness of first line field/column names) to the input field numbers then loop through just the output field numbers (note: you don't have to loop through all of the input fields, just the ones that you're going to output) printing the value of the corresponding input field number:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==1 {
for (i=1; i<=NF; i++) {
if ( !seen[$i]++ ) {
f[++nf] = i
}
}
}
{
for (i=1; i<=nf; i++) {
printf "%s%s", $(f[i]), (i<nf ? OFS : ORS)
}
}
.
$ awk -f tst.awk file
Class,Gene,col3,col6
A,FF,23,16
B,GG,45,808
C,BB,43,76
Here's a version with more meaningful variable names and a couple of intermediate variables to clarify what's going on:
BEGIN { FS=OFS="," }
NR==1 {
numInFlds = NF
for (inFldNr=1; inFldNr<=numInFlds; inFldNr++) {
fldName = $inFldNr
if ( !seen[fldName]++ ) {
out2in[++numOutFlds] = inFldNr
}
}
}
{
for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
inFldNr = out2in[outFldNr]
fldValue = $inFldNr
printf "%s%s", fldValue, (outFldNr<numOutFlds ? OFS : ORS)
}
}

Print the first two columns and then iterate in strides of 3 to skip the Class and Gene columns in the rest of the row.
awk -F, '{printf("%s,%s", $1, $2); for (i=3; i<=NF; i+=3) printf(",%s", $i); printf("\n")}'

Related

File fields and columns adjustment with awk [LINUX]

I have an issue with columns delimiters adjustment in a file in linux into a database.
I need 14 columns and I use "|" as a delimiter so I applied :
awk -F'|' '{missing=14-NF;if(missing==0){print $0}else{printf "%s",$0;for(i=1;i<=missing-1;i++){printf "|"};print "|"}}' myFile
Suppose I have a row like that:
a|b|c|d|e||f||g||||h|i|
after applying the awk command it will be:
a|b|c|d|e||f||g||||h|i||
and this is not acceptable I need the data to be 14 columns only.
Sample input {In case of 14 fields row]:
a|b|c|d|e||f||g||||h|i
Do nothing
Sample input {In case of extra fields]:
a|b|c|d|e||f||g||||h|i|
ouput:
a|b|c|d|e||f||g||||h|i
Sample Input {In case of less fields}:
a|b|c|d||e||f||g|h
output:
a|b|c|d||e||f||g|h|||
You may use this gnu-awk solution:
awk -v n=14 '
BEGIN {FS=OFS="|"}
{
$0 = gensub(/^(([^|]*\|){13}[^|]*)\|.*/, "\\1", "1")
for (i=NF+1; i<=n; ++i)
$i = ""
} 1' file
a|b|c|d|e||f||g||||h|i
a|b|c|d|e||f||g||||h|i
a|b|c|d||e||f||g|h|||
Where original file is this:
cat file
a|b|c|d|e||f||g||||h|i
a|b|c|d|e||f||g||||h|i|
a|b|c|d||e||f||g|h
Here:
Using gnsub we remove all extra fields
Using for loop we create new fields to make NF = n
If you don't have gnu-awk then following should work on non-gnu awk (tested on BSD awk):
awk -v n=14 '
BEGIN {FS=OFS="|"}
{
for (i=NF+1; i<=n; ++i) $i=""
for (i=n+1; i<=NF; ++i) $i=""
NF = n
} 1' file

How do I detect embbeded field names and reorder fields using awk?

I have the following data:
"b":1.14105,"a":1.14106,"x":48,"t":1594771200000
"a":1.141,"b":1.14099,"x":48,"t":1594771206000
...
I am trying to display data in a given order and only for three fields. As the fields order is not guaranteed, I need to read the "tag" for each comma separated column for each line.
I have tried to solve this task using awk:
awk -F',' '
{
for(i=1; i<=$NF; i++) {
if(index($i,"\"a\":")!=0) a=$i;
if(index($i,"\"b\":")!=0) b=$i;
if(index($i,"\"t\":")!=0) t=$i;
}
printf("%s,%s,%s\n",a,b,t);
}
'
But I get:
,,
,,
...
In the above data sample, I would expect:
"a":1.14106,"b":1.14105,"t":1594771200000
"a":1.141,"b":1.14099,"t":1594771206000
...
Note: I am using the awk shipped with FreeBSD
$ cat tst.awk
BEGIN {
FS = "[,:]"
OFS = ","
}
{
for (i=1; i<NF; i+=2) {
f[$i] = $(i+1)
}
print p("a"), p("b"), p("t")
}
function p(tag, t) {
t = "\"" tag "\""
return t ":" f[t]
}
.
$ awk -f tst.awk file
"a":1.14106,"b":1.14105,"t":1594771200000
"a":1.141,"b":1.14099,"t":1594771206000
With awk and an array:
awk -F '[:,]' '{for(i=1; i<=NF; i=i+2){a[$i]=$(i+1)}; print "\"a\":" a["\"a\""] ",\"b\":" a["\"b\""] ",\"t\":" a["\"t\""]}' file
or
awk -F '[":,]' '{for(i=2; i<=NF; i=i+4){a[$i]=$(i+2)}; print "\"a\":" a["a"] ",\"b\":" a["b"] ",\"t\":" a["t"]}' file
Output:
"a":1.14106,"b":1.14105,"t":1594771200000
"a":1.141,"b":1.14099,"t":1594771206000
similar awk where you can specify the fields and order.
$ awk -F[:,] -v fields='"a","b","t"' 'BEGIN{n=split(fields,f)}
{for(i=1;i<NF;i+=2) map[$i]=$(i+1);
for(i=1;i<=n;i++) printf "%s", f[i]":"map[f[i]] (i==n?ORS:",")}' file
"a":1.14106,"b":1.14105,"t":1594771200000
"a":1.141,"b":1.14099,"t":1594771206000

How to have a good output in bash when printing?

I have this command here, and I have a problem achieving a good format.
In this lines,
DATE*2014*09*23
VAL*0001*ABC
N3*Sample
VAL*0002*XYZ
My desired output here is like this:
["ABC", "XYC"]
I tried this code:
perl -nle 'print $& if /VAL\*[0-9]*\*\K.*/' file | awk '{ printf "\"%s\",", $0 }'
resulting only:
"ABC","XYZ",
Another thing is that when printing only one value.
If it happens that a file is like this:
DATE*2014*09*23
VAL*0001*ABC
N3*Sample
my desired output would only be like this (ignoring the output of having []):
"ABC"
You can do all with awk:
#!/usr/bin/awk -f
BEGIN {FS="*"; i=0; ORS=""}
$1=="VAL" {a[i++]=$3}
END {
if (i>1) {
print "[\"" a[0]
for (j = 1; j < i; j++)
print "\",\"" a[j]
print "\"]"
}
if (i==1)
print "\"" a[0] "\""
}

awk remove unwanted records and consolidate multiline fields to one line record in specific order

I have an output file that I am trying to process into a formatted csv for our audit team.
I thought I had this mastered until I stumbled across bad data within the output. As such, I want to be able to handle this using awk.
MY OUTPUT FILE EXAMPLE
Enter password ==>
o=hoster
ou=people,o=hoster
ou=components,o=hoster
ou=websphere,ou=components,o=hoster
cn=joe-bloggs,ou=appserver,ou=components,o=hoster
cn=joe
sn=bloggs
cn=S01234565
uid=bloggsj
cn=john-blain,ou=appserver,ou=components,o=hoster
cn=john
uid=blainj
sn=blain
cn=andy-peters,ou=appserver,ou=components,o=hoster
cn=andy
sn=peters
uid=petersa
cn=E09876543
THE OUTPUT I WANT AFTER PROCESSING
joe,bloggs,s01234565;uid=bloggsj,cn=joe-bloggs,ou=appserver,ou=components,o=hoster
john,blain;uid=blainj;cn=john-blain,ou=appserver,ou=components,o=hoster
andy,peters,E09876543;uid=E09876543;cn=andy-peters,ou=appserver,ou=components,o=hoster
As you can see:
we always have a cn= variable that contains o=hoster
uid can have any value
we may have multiple cn= variables without o=hoster
I have acheived the following:
cat output | awk '!/^o.*/ && !/^Enter.*/{print}' | awk '{getline a; getline b; getline c; getline d; print $0,a,b,c,d}' | awk -v srch1="cn=" -v repl1="" -v srch2="sn=" -v repl2="" '{ sub(srch1,repl1,$2); sub(srch2,repl2,$3); print $4";"$2" "$3";"$1 }'
Any pointers or guidance is greatly appreciated using awk. Or should I give up and just use the age old long winded method a large looping script to process the file?
You may try following awk code
$ cat file
Enter password ==>
o=hoster
ou=people,o=hoster
ou=components,o=hoster
ou=websphere,ou=components,o=hoster
cn=joe-bloggs,ou=appserver,ou=components,o=hoster
cn=joe
sn=bloggs
cn=S01234565
uid=bloggsj
cn=john-blain,ou=appserver,ou=components,o=hoster
cn=john
uid=blainj
sn=blain
cn=andy-peters,ou=appserver,ou=components,o=hoster
cn=andy
sn=peters
uid=petersa
cn=E09876543
Awk Code :
awk '
function out(){
print s,u,last
i=0; s=""
}
/^cn/,!NF{
++i
last = i == 1 ? $0 : last
s = i>1 && !/uid/ && NF ? s ? s "," $NF : $NF : s
u = /uid/ ? $0 : u
}
i && !NF{
out()
}
END{
out()
}
' FS="=" OFS=";" file
Resulting
joe,bloggs,S01234565;uid=bloggsj;cn=joe-bloggs,ou=appserver,ou=components,o=hoster
john,blain;uid=blainj;cn=john-blain,ou=appserver,ou=components,o=hoster
andy,peters,E09876543;uid=petersa;cn=andy-peters,ou=appserver,ou=components,o=hoster
If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk
This awk script works for your sample and produces the sample output:
BEGIN { delete cn[0]; OFS = ";" }
function print_info() {
if (length(cn)) {
names = cn[1] "," sn
for (i=2; i <= length(cn); ++i) names = names "," cn[i]
print names, uid, dn
delete cn
}
}
/^cn=/ {
if ($0 ~ /o=hoster/) dn = $0
else {
cn[length(cn)+1] = substr($0, index($0, "=") + 1)
uid = $0; sub("cn", "uid", uid)
}
}
/^sn=/ { sn = substr($0, index($0, "=") + 1) }
/^uid=/ { uid = $0 }
/^$/ { print_info() }
END { print_info() }
This should help you get started.
awk '$1 ~ /^cn/ {
for (i = 2; i <= NF; i++) {
if ($i ~ /^uid/) {
u = $i
continue
}
sub(/^[^=]*=/, x, $i)
r = length(r) ? r OFS $i : $i
}
print r, u, $1
r = u = x
}' OFS=, RS= infile
I assume that there is an error in your sample output: in the 3d record the uid should be petersa and not E09876543.
You might want look at some of the "already been there and done that" solutions to accomplish the task.
Apache Directory Studio for example, will do the LDAP query and save the file in CSV or XLS format.
-jim

grep by row and column using bash

Assuming the following sample csv input
SYMBOL,JAN-11 ,FEB-11 ,MAR-11
DEF ,20 ,25 ,20
HIG ,50 ,50 ,50
Is there anyway to grep for a particular value using both row and column
i.e. grep for symbol DEF and FEB-11 should return value 25
The row-wise grep is trivial but i am having problems with the column wise grep.
any help would be greatly appreciated.
As #Ignacio Vazquez-Abrams said, awk is a much better tool for this job. Try the following script:
#!/bin/awk -f
# usage: csvgrep row column [file]
BEGIN {
FS = "[ \t]*,[ \t]"
row = ARGV[1]
col = ARGV[2]
ARGV[1] = ARGV[2] = ""
# read header
getline
for (i=1; i<=NF; i++)
if ($i == col) {
col = i
break
}
}
($1 == row) { print $col }
You may want to add input validation. awk may be in /usr/bin on your system.