grep by row and column using bash - regex

Assuming the following sample csv input
SYMBOL,JAN-11 ,FEB-11 ,MAR-11
DEF ,20 ,25 ,20
HIG ,50 ,50 ,50
Is there anyway to grep for a particular value using both row and column
i.e. grep for symbol DEF and FEB-11 should return value 25
The row-wise grep is trivial but i am having problems with the column wise grep.
any help would be greatly appreciated.

As #Ignacio Vazquez-Abrams said, awk is a much better tool for this job. Try the following script:
#!/bin/awk -f
# usage: csvgrep row column [file]
BEGIN {
FS = "[ \t]*,[ \t]"
row = ARGV[1]
col = ARGV[2]
ARGV[1] = ARGV[2] = ""
# read header
getline
for (i=1; i<=NF; i++)
if ($i == col) {
col = i
break
}
}
($1 == row) { print $col }
You may want to add input validation. awk may be in /usr/bin on your system.

Related

File fields and columns adjustment with awk [LINUX]

I have an issue with columns delimiters adjustment in a file in linux into a database.
I need 14 columns and I use "|" as a delimiter so I applied :
awk -F'|' '{missing=14-NF;if(missing==0){print $0}else{printf "%s",$0;for(i=1;i<=missing-1;i++){printf "|"};print "|"}}' myFile
Suppose I have a row like that:
a|b|c|d|e||f||g||||h|i|
after applying the awk command it will be:
a|b|c|d|e||f||g||||h|i||
and this is not acceptable I need the data to be 14 columns only.
Sample input {In case of 14 fields row]:
a|b|c|d|e||f||g||||h|i
Do nothing
Sample input {In case of extra fields]:
a|b|c|d|e||f||g||||h|i|
ouput:
a|b|c|d|e||f||g||||h|i
Sample Input {In case of less fields}:
a|b|c|d||e||f||g|h
output:
a|b|c|d||e||f||g|h|||
You may use this gnu-awk solution:
awk -v n=14 '
BEGIN {FS=OFS="|"}
{
$0 = gensub(/^(([^|]*\|){13}[^|]*)\|.*/, "\\1", "1")
for (i=NF+1; i<=n; ++i)
$i = ""
} 1' file
a|b|c|d|e||f||g||||h|i
a|b|c|d|e||f||g||||h|i
a|b|c|d||e||f||g|h|||
Where original file is this:
cat file
a|b|c|d|e||f||g||||h|i
a|b|c|d|e||f||g||||h|i|
a|b|c|d||e||f||g|h
Here:
Using gnsub we remove all extra fields
Using for loop we create new fields to make NF = n
If you don't have gnu-awk then following should work on non-gnu awk (tested on BSD awk):
awk -v n=14 '
BEGIN {FS=OFS="|"}
{
for (i=NF+1; i<=n; ++i) $i=""
for (i=n+1; i<=NF; ++i) $i=""
NF = n
} 1' file

How to remove identical columns in a csv file using Bash

There are already a lot of questions like this but neither of them did help me. I want to keep this simple:
I have a file (more than 90 columns) like:
Class,Gene,col3,Class,Gene,col6,Class
A,FF,23,A,FF,16,A
B,GG,45,B,GG,808,B
C,BB,43,C,BB,76,C
I want to keep unique columns so the desired output should be:
Class,Gene,col3,col6
A,FF,23,16
B,GG,45,808
C,BB,43,76
I used awk '!a[$0]++' but it did not remove the repeated columns of the file.
As a side note: I have repetitive columns because I used paste command to join different files column-wise.
You may use this awk to print unique columns based on their names in first header row:
awk 'BEGIN {
FS=OFS="," # set input/output field separators as comma
}
NR == 1 { # for first header row
for (i=1; i<=NF; i++) # loop through all columns
if (!ucol[$i]++) # if col name is not in a unique array
hdr[i] # then store column no. in an array hdr
}
{
for (i=1; i<=NF; i++) # loop through all columns
if (i in hdr) # if col no. is found in array hdr then print
printf "%s",(i==1?"":OFS) $i # then print col with OFS
print "" # print line break
}' file
Class,Gene,col3,col6
A,FF,23,16
B,GG,45,808
C,BB,43,76
For your specific case where you're just trying to remove 2 cols added by paste per original file all you need is:
$ awk '
BEGIN { FS=OFS="," }
{ r=$1 OFS $2; for (i=3; i<=NF; i+=3) r=r OFS $i; print r }
' file
Class,Gene,col3,col6
A,FF,23,16
B,GG,45,808
C,BB,43,76
but in other situations where it's not as simple: create an array (f[] below) that maps output field numbers (determined based on uniqueness of first line field/column names) to the input field numbers then loop through just the output field numbers (note: you don't have to loop through all of the input fields, just the ones that you're going to output) printing the value of the corresponding input field number:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==1 {
for (i=1; i<=NF; i++) {
if ( !seen[$i]++ ) {
f[++nf] = i
}
}
}
{
for (i=1; i<=nf; i++) {
printf "%s%s", $(f[i]), (i<nf ? OFS : ORS)
}
}
.
$ awk -f tst.awk file
Class,Gene,col3,col6
A,FF,23,16
B,GG,45,808
C,BB,43,76
Here's a version with more meaningful variable names and a couple of intermediate variables to clarify what's going on:
BEGIN { FS=OFS="," }
NR==1 {
numInFlds = NF
for (inFldNr=1; inFldNr<=numInFlds; inFldNr++) {
fldName = $inFldNr
if ( !seen[fldName]++ ) {
out2in[++numOutFlds] = inFldNr
}
}
}
{
for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
inFldNr = out2in[outFldNr]
fldValue = $inFldNr
printf "%s%s", fldValue, (outFldNr<numOutFlds ? OFS : ORS)
}
}
Print the first two columns and then iterate in strides of 3 to skip the Class and Gene columns in the rest of the row.
awk -F, '{printf("%s,%s", $1, $2); for (i=3; i<=NF; i+=3) printf(",%s", $i); printf("\n")}'

Regex to replace a character if is unique

I need please help with a script with a regex to fix a big text file under linux (with sed for example). My records looks like:
1373350|Doe, John|John|Doe|||B|Acme corp|...
1323350|Simpson, Homer|Homer|Simpson|||3|Moe corp|...
I need to validate if the 7th column has a unique character (maybe a letter or number) and if true, add the second column without the comma, i mean:
1373350|Doe, John|John|Doe|||B Doe John|Acme corp|...
1323350|Simpson, Homer|Homer|Simpson|||3 Simpson Homer|Moe corp|...
Any help? Thanks!
Awk is better suited for this job:
awk -F '|' 'BEGIN { OFS = FS } length($7) == 1 { x = $2; sub(/,/, "", x); $7 = $7 " " x } 1' filename
That is:
BEGIN { OFS = FS } # output separated the same way as the input
length($7) == 1 { # if the 7th field is one character long
x = $2 # make a copy of the second field
sub(/,/, "", x) # remove comma from it
$7 = $7 " " x # append it to seventh field
}
1 # print line

awk remove unwanted records and consolidate multiline fields to one line record in specific order

I have an output file that I am trying to process into a formatted csv for our audit team.
I thought I had this mastered until I stumbled across bad data within the output. As such, I want to be able to handle this using awk.
MY OUTPUT FILE EXAMPLE
Enter password ==>
o=hoster
ou=people,o=hoster
ou=components,o=hoster
ou=websphere,ou=components,o=hoster
cn=joe-bloggs,ou=appserver,ou=components,o=hoster
cn=joe
sn=bloggs
cn=S01234565
uid=bloggsj
cn=john-blain,ou=appserver,ou=components,o=hoster
cn=john
uid=blainj
sn=blain
cn=andy-peters,ou=appserver,ou=components,o=hoster
cn=andy
sn=peters
uid=petersa
cn=E09876543
THE OUTPUT I WANT AFTER PROCESSING
joe,bloggs,s01234565;uid=bloggsj,cn=joe-bloggs,ou=appserver,ou=components,o=hoster
john,blain;uid=blainj;cn=john-blain,ou=appserver,ou=components,o=hoster
andy,peters,E09876543;uid=E09876543;cn=andy-peters,ou=appserver,ou=components,o=hoster
As you can see:
we always have a cn= variable that contains o=hoster
uid can have any value
we may have multiple cn= variables without o=hoster
I have acheived the following:
cat output | awk '!/^o.*/ && !/^Enter.*/{print}' | awk '{getline a; getline b; getline c; getline d; print $0,a,b,c,d}' | awk -v srch1="cn=" -v repl1="" -v srch2="sn=" -v repl2="" '{ sub(srch1,repl1,$2); sub(srch2,repl2,$3); print $4";"$2" "$3";"$1 }'
Any pointers or guidance is greatly appreciated using awk. Or should I give up and just use the age old long winded method a large looping script to process the file?
You may try following awk code
$ cat file
Enter password ==>
o=hoster
ou=people,o=hoster
ou=components,o=hoster
ou=websphere,ou=components,o=hoster
cn=joe-bloggs,ou=appserver,ou=components,o=hoster
cn=joe
sn=bloggs
cn=S01234565
uid=bloggsj
cn=john-blain,ou=appserver,ou=components,o=hoster
cn=john
uid=blainj
sn=blain
cn=andy-peters,ou=appserver,ou=components,o=hoster
cn=andy
sn=peters
uid=petersa
cn=E09876543
Awk Code :
awk '
function out(){
print s,u,last
i=0; s=""
}
/^cn/,!NF{
++i
last = i == 1 ? $0 : last
s = i>1 && !/uid/ && NF ? s ? s "," $NF : $NF : s
u = /uid/ ? $0 : u
}
i && !NF{
out()
}
END{
out()
}
' FS="=" OFS=";" file
Resulting
joe,bloggs,S01234565;uid=bloggsj;cn=joe-bloggs,ou=appserver,ou=components,o=hoster
john,blain;uid=blainj;cn=john-blain,ou=appserver,ou=components,o=hoster
andy,peters,E09876543;uid=petersa;cn=andy-peters,ou=appserver,ou=components,o=hoster
If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk
This awk script works for your sample and produces the sample output:
BEGIN { delete cn[0]; OFS = ";" }
function print_info() {
if (length(cn)) {
names = cn[1] "," sn
for (i=2; i <= length(cn); ++i) names = names "," cn[i]
print names, uid, dn
delete cn
}
}
/^cn=/ {
if ($0 ~ /o=hoster/) dn = $0
else {
cn[length(cn)+1] = substr($0, index($0, "=") + 1)
uid = $0; sub("cn", "uid", uid)
}
}
/^sn=/ { sn = substr($0, index($0, "=") + 1) }
/^uid=/ { uid = $0 }
/^$/ { print_info() }
END { print_info() }
This should help you get started.
awk '$1 ~ /^cn/ {
for (i = 2; i <= NF; i++) {
if ($i ~ /^uid/) {
u = $i
continue
}
sub(/^[^=]*=/, x, $i)
r = length(r) ? r OFS $i : $i
}
print r, u, $1
r = u = x
}' OFS=, RS= infile
I assume that there is an error in your sample output: in the 3d record the uid should be petersa and not E09876543.
You might want look at some of the "already been there and done that" solutions to accomplish the task.
Apache Directory Studio for example, will do the LDAP query and save the file in CSV or XLS format.
-jim

AWK print just the number of a field from an array. R

I am printing a list like this (info[i]):
DP=366
DP=181
DP=254
DP=463
And I want to get rid of the DP= and ending up with only the number to process afterwards the data in R.
with this script in awk I obtain the previous list:
substr($1,1,1) != "#"{
split ($8, info, ";");
num = asort(info);
for ( i=1; i<=num; i++) {
if (info[i] ~ "DP") {
print info[i]
}
}
}
I suppose that a regex would help, but no idea to use in awk. Thanks in advance!
try this: (just modified your original codes ):
substr($1,1,1) != "#"{
split ($8, info, ";");
num = asort(info);
for ( i=1; i<=num; i++) {
if (info[i] ~ "DP") {
sub(/DP=/,"",info[i])
print info[i]
}
}
}
If you have more columns in the input, you can say:
awk '{sub("[^0-9]*", "", $1)}1' inputfile
In R one could just use:
sub("^.+\\=", "", info)
No need for loop. Only reason to use awk would be if file were too large to fit in memory.
Using awk
awk -F= '{print $2}' file
366
181
254
463